My latest contribution for scikit-learn is
an implementation of the isotonic regression model that I coded with
Nelle Varoquaux and
Alexandre Gramfort. This model
finds the best least squares fit to a set of points, given the
constraint that the fit must be a non-decreasing
function. The example
on the scikit-learn website gives an intuition on this model.

The original points are in red, and the estimated ones are in
green. As you can see, there is one estimation (green point) for each
data sample (red point). Calling $y \in \mathbb{R}^n$ the input data,
the model can be written concisely as an optimization problem over $x$

$$
\text{argmin}_x |y - x |^2 \\
\text{subject to } x_0 \leq x_1 \leq \cdots \leq x_n
$$

The algorithm implemented in scikit-learn ^{3} is the pool adjacent
violators algorithm ^{1}, which is an efficient linear time
$\mathcal{O}(n)$ algorithm. The algorithm sweeps through the data
looking for violations of the monotonicity constraint. When it finds
one, it adjusts the estimate to the best possible fit with
constraints. Sometimes it also needs to modify previous points to make
sure the new estimate does not violate the constraints. The following
picture shows how it proceeds at each iteration

Besides performing a line-by-line analysis of memory consumption,
`memory_profiler`

exposes some functions that allow to retrieve the memory consumption
of a function in real-time, allowing e.g. to visualize the memory
consumption of a given function over time.

The function to be used is `memory_usage`

. The first argument
specifies what code is to be monitored. This can represent either an
external process or a Python function. In the case of an external
process the first argument is an integer representing its process
identifier (PID). In the case of a Python function, we need pass the
function and its arguments to memory_usage. We do this by passing the
tuple `(f, args, kw)`

that specifies the function, its position
arguments as a tuple and its keyword arguments as a dictionary,
respectively. This will be then executed by `memory_usage`

as
`f(*args, **kw)`

.

Let's see this with an example. Take as function NumPy's
pseudo-inverse function. Thus
`f = numpy.linalg.pinv`

and `f`

takes one positional argument (the
matrix to be inverted) so `args = (a,)`

where `a`

is the matrix to be
inverted. Note that args must be a tuple consisting of the different
arguments, thus the parenthesis around `a`

. The third item is a
dictionary `kw`

specifying the keyword arguments. Here kw is optional
and is omitted.

>>> from memory_profiler import memory_usage
>>> import numpy as np
# create a random matrix
>>> a = np.random.randn(500, 500)
>>> mem_usage = memory_usage((np.linalg.pinv, (a,)), interval=.01)
>>> print(mem_usage)
[57.02734375, 55.0234375, 57.078125, ...]

This has given me a list specifying at different time intervals ```
(t0,
t0 + .01, t0 + .02, ...)
```

at which the measurements where taken. Now I can
use that to for example plot the memory consumption as a function of
time:

>>> import pylab as pl
>>> pl.plot(np.arange(len(mem_usage)) * .01, mem_usage, label='linalg.pinv')
>>> pl.xlabel('Time (in seconds)')
>>> pl.ylabel('Memory consumption (in MB)')
>>> pl.show()

This will give the memory usage of a single function across time, which
might be interesting for example to detect temporaries that would be
created during the execution.

Another use case for `memory_usage`

would be to see how memory behaves
as input data gets bigger. In this case we are interested in memory as
a function of the input data. One obvious way we can do this is by
calling the same function each time with a different input and take as
memory consumption the maximum consumption over time. This way we will
have a memory usage for each input.

>>> for i in range(1, 5):
... A = np.random.randn(100 * i, 100 * i)
... mem_usage = memory_usage((np.linalg.pinv, (A,)))
... print max(mem_usage)
29.22
30.10
40.66
53.96

It is now possible to plot these results as a function of the
dimensions.

import numpy as np
import pylab as pl
from memory_profiler import memory_usage
dims = np.linspace(100, 1000, 10)
pinv_mem = np.zeros(dims.size)
for i_dim, k in enumerate(dims):
x = np.random.randn(k, k)
tmp = memory_usage((np.linalg.pinv, (x,)), interval=.01)
pinv_mem[i_dim] = np.max(tmp)
pl.plot(dims, pinv_mem, label='np.linalg.pinv')
pl.ylabel('Memory (in MB)')
pl.xlabel('Dimension of the square matrix')
pl.legend(loc='upper left')
pl.axis('tight')
pl.show()

SciPy contains two methods to compute the singular value decomposition (SVD) of a matrix: `scipy.linalg.svd`

and `scipy.sparse.linalg.svds`

. In this post I'll compare both methods for the task of computing the full SVD of a large dense matrix.

The first method, `scipy.linalg.svd`

, is perhaps the best known and uses the linear algebra library LAPACK to handle the computations. This implements the Golub-Kahan-Reisch algorithm ^{1}, which is accurate and highly efficient with a cost of O(n^3) floating-point operations ^{2}.

The second method is `scipy.sparse.linalg.svds`

and despite it's name it works fine also for dense arrays. This implementation is based on the ARPACK library and consists of an iterative procedure that finds the SVD decomposition by reducing the problem to an eigendecomposition on the array X -> dot(X.T, X). This method is usually very effective when the input matrix X is sparse or only the largest singular values are required. There are other SVD solvers that I did not consider, such as sparsesvd or pysparse.jdsym, but my points for the sparse solve probably hold for those packages too since they both implement iterative algorithms based on the same principles.

When the input matrix is dense and all the singular values are required, the first method is usually more efficient. To support this statement I've created a little benchmark: timings for both methods as a function of the size of the matrices. Notice that we are in a case that is clearly favorable to the `linalg.svd`

: after all `sparse.linalg.svds`

was not created with this setting in mind, it was created for sparse matrices or dense matrices with some special structure. We will see however that even in this setting it has interesting advantages.

I'll create random square matrices with different sizes and plot the timings for both methods. For the benchmarks I used SciPy v0.12 linked against Intel Math Kernel Library v11. Both methods are single-threaded (I had to set OMP_NUM_THREADS=1 so that MKL does not try to parallelize the computations). [code]

Lower timings are better, so this gives `scipy.linalg.svd`

as clear winner. However, this is just part of the story. What this graph doesn't show is that this method is winning at the price of allocating a huge amount of memory for temporary computations. If we now plot the memory consumption for both methods under the same settings, the story is completely different. [code]

The memory requirements of `scipy.linalg.svd`

scale with the number of dimensions, while for the sparse version the amount of allocated memory is constant. Notice that we are measuring the amount of total memory used, it is thus natural to see a slight increase in memory consumption since the input matrix is bigger on each iteration.

For example, in my applications, I need to compute the SVD of a matrix for whom the needed workspace does not fit in memory. In cases like this, the sparse algorithm (`sparse.linalg.svds`

) can come in handy: the timing is just a factor worse (but I can easily parallelize jobs) and the memory requirements for this method is peanuts compared to the dense version.

This tutorial introduces the concept of pairwise preference used in most ranking problems. I'll use scikit-learn and for learning and matplotlib for visualization.

In the ranking setting, training data consists of lists of items with some order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item, so that for any two samples `a`

and `b`

, either `a < b`

, `b > a`

or `b`

and `a`

are not comparable.

For example, in the case of a search engine, our dataset consists of results that belong to different queries and we would like to only compare the relevance for results coming from the same query.

This order relation is usually domain-specific. For instance, in information retrieval the set of comparable samples is referred to as a "query id". The goal behind this is to compare only documents that belong to the same query (Joachims 2002). In medical imaging on the other hand, the order of the labels usually depend on the subject so the comparable samples is given by the different subjects in the study (Pedregosa et al 2012).

import itertools
import numpy as np
from scipy import stats
import pylab as pl
from sklearn import svm, linear_model, cross_validation

To start with, we'll create a dataset in which the target values consists of three graded measurements Y = {0, 1, 2} and the input data is a collection of 30 samples, each one with two features.

The set of comparable elements (queries in information retrieval) will consist of two equally sized blocks, $X = X_1 \cup X_2$, where each block is generated using a normal distribution with different mean and covariance. In the pictures, we represent $X_1$ with round markers and $X_2$ with triangular markers.

np.random.seed(0)
theta = np.deg2rad(60)
w = np.array([np.sin(theta), np.cos(theta)])
K = 20
X = np.random.randn(K, 2)
y = [0] * K
for i in range(1, 3):
X = np.concatenate((X, np.random.randn(K, 2) + i * 4 * w))
y = np.concatenate((y, [i] * K))
# slightly displace data corresponding to our second partition
X[::2] -= np.array([3, 7])
blocks = np.array([0, 1] * (X.shape[0] / 2))
# split into train and test set
cv = cross_validation.StratifiedShuffleSplit(y, test_size=.5)
train, test = iter(cv).next()
X_train, y_train, b_train = X[train], y[train], blocks[train]
X_test, y_test, b_test = X[test], y[test], blocks[test]
# plot the result
idx = (b_train == 0)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(0, 1, '$w$', fontsize=20)
pl.arrow(-3, -8, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(-2.6, -7, '$w$', fontsize=20)
pl.axis('equal')
pl.show()

In the plot we clearly see that for both blocks there's a common vector w such that the projection onto w gives a list with the correct ordering.

However, because linear considers that output labels live in a metric space it will consider that all pairs are comparable. Thus if we fit this model to the problem above it will fit both blocks at the same time, yielding a result that is clearly not optimal. In the following plot we estimate $\hat{w}$ using an l2-regularized linear model.

ridge = linear_model.Ridge(1.)
ridge.fit(X_train, y_train)
coef = ridge.coef_ / linalg.norm(ridge.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(2, 0, '$\hat{w}$', fontsize=20)
pl.axis('equal')
pl.title('Estimation by Ridge regression')
pl.show()

To assess the quality of our model we need to define a ranking score. Since we are interesting in a model that *orders* the data, it is natural to look for a metric that compares the ordering of our model to the given ordering. For this, we use Kendall's tau correlation coefficient, which is defined as (P - Q)/(P + Q), being P the number of concordant pairs and Q is the number of discordant pairs. This measure is used extensively in the ranking literature (e.g Optimizing Search Engines using Clickthrough Data).

We thus evaluate this metric on the test set for each block separately.

for i in range(2):
tau, _ = stats.kendalltau(
ridge.predict(X_test[b_test == i]), y_test[b_test == i])
print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))

Kendall correlation coefficient for block 0: 0.71122
Kendall correlation coefficient for block 1: 0.84387

## The pairwise transform

As proved in (Herbrich 1999), if we consider linear ranking functions, the ranking problem can be transformed into a two-class classification problem. For this, we form the difference of all comparable elements such that our data is transformed into $(x'_k, y'_k) = (x_i - x_j, sign(y_i - y_j))$ for all comparable pairs.

This way we transformed our ranking problem into a two-class classification problem. The following plot shows this transformed dataset, and color reflects the difference in labels, and our task is to separate positive samples from negative ones. The hyperplane {x^T w = 0} separates these two classes.

# form all pairwise combinations
comb = itertools.combinations(range(X_train.shape[0]), 2)
k = 0
Xp, yp, diff = [], [], []
for (i, j) in comb:
if y_train[i] == y_train[j] \
or blocks[train][i] != blocks[train][j]:
# skip if same target or different group
continue
Xp.append(X_train[i] - X_train[j])
diff.append(y_train[i] - y_train[j])
yp.append(np.sign(diff[-1]))
# output balanced classes
if yp[-1] != (-1) ** k:
yp[-1] *= -1
Xp[-1] *= -1
diff[-1] *= -1
k += 1
Xp, yp, diff = map(np.asanyarray, (Xp, yp, diff))
pl.scatter(Xp[:, 0], Xp[:, 1], c=diff, s=60, marker='o', cmap=pl.cm.Blues)
x_space = np.linspace(-10, 10)
pl.plot(x_space * w[1], - x_space * w[0], color='gray')
pl.text(3, -4, '$\{x^T w = 0\}$', fontsize=17)
pl.axis('equal')
pl.show()

As we see in the previous plot, this classification is separable. This will not always be the case, however, in our training set there are no order inversions, thus the respective classification problem is separable.

We will now finally train an Support Vector Machine model on the transformed data.
This model is known as RankSVM, although we note that the pairwise transform is more general and can be used together with any linear model. We will then plot the training data together with the estimated coefficient $\hat{w}$ by RankSVM.

clf = svm.SVC(kernel='linear', C=.1)
clf.fit(Xp, yp)
coef = clf.coef_.ravel() / linalg.norm(clf.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.arrow(-3, -8, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(1, .7, '$\hat{w}$', fontsize=20)
pl.text(-2.6, -7, '$\hat{w}$', fontsize=20)
pl.axis('equal')
pl.show()

Finally we will check that as expected, the ranking score (Kendall tau) increases with the RankSVM model respect to linear regression.

for i in range(2):
tau, _ = stats.kendalltau(
np.dot(X_test[b_test == i], coef), y_test[b_test == i])
print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))

Kendall correlation coefficient for block 0: 0.83627
Kendall correlation coefficient for block 1: 0.84387

This is indeed higher than the values (0.71122, 0.84387) obtained in the case of linear regression.

**Original ipython notebook for this blog post can be found here**

My newest project is a Python library for monitoring memory consumption
of arbitrary process, and one of its most useful features is the
line-by-line analysis of memory usage for Python code. I wrote a basic
prototype six months ago after being surprised by the lack of related
tools. I wanted to plot memory consumption of a couple of Python
functions but did not find a python module to do the job. I came to the
conclusion that there is no standard way to get the memory usage of the
Python interpreter from within Python, so I resorted to reading for from
`/proc/$PID/statm`. From there on I realized that one the fetching of
memory is done, making a line-by-line report wouldn't be hard. Back to
today. I've been using the line-by-line memory monitoring to diagnose
poor memory management (hidden temporaries, unused allocation, etc.) for
some time. It seems to work on two different computers, so full of
confidence as I am, I'll write a blog post about it ...

## How to use it?

The easiest way to get it is to install from the Python Package Index:

$ easy_install -U memory_profiler # pip install -U memory_profiler

but other options include fetching the latests
from github or dropping it on your current working directory or
somewhere else on your PYTHONPATH since it consist of a single file.
Then next step is to write some python code to profile. It can be just
about any function, but for the purpose of this blog post I'll create a
function ``my_func()`` with mostly memory allocations and save it to a file
named example.py:

@profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
if __name__ == '__main__':
my_func()

Note that I've decorated the function
with @profile. This tells the profiler to look into function my_func
and gather the memory consumption for each line.

## Wake up the cookie monster

To start profiling and output the result to stdout, run the script as
usual and append the options -m memory_profiler -l -v to the python
interpreter:

$ python -m memory_profiler example.py
Filename: example.py
Line # Mem usage Increment Line Contents
================================================
2 @profile
3 8.00 MB 0.00 MB def my_func():
4 15.00 MB 7.00 MB a = [1] * (10 ** 6)
5 168.00 MB 153.00 MB b = [2] * (2 * 10 ** 7)
6 15.00 MB -153.00 MB del b
7 15.00 MB 0.00 MB return a

voilá! Each line is prefixed by the memory usage in
MB of the Python interpreter after that line has been executed.

In scipy's development version there's a new function closely related to
the QR-decomposition of a matrix and to the least-squares solution of
a linear system. What this function does is to compute the
QR-decomposition of a matrix and then multiply the resulting orthogonal
factor by another arbitrary matrix. In pseudocode:

def qr_multiply(X, Y):
Q, R = qr(X)
return dot(Q.T, Y)

but unlike this naive implementation, `qr_multiply` is able to do
all this **without** explicitly computing the orthogonal Q matrix,
resulting both in memory and time saving. In the following picture I
measured the memory consumption as a function of time of running this
computation on a 1.000 x 1.000 matrix X and a vector Y (full code can
be found here):

It can be seen that not only `qr_multiply` is
almost twice as fast as the naive approach, but also that the memory
consumption is significantly reduced, since the orthogonal factor is
never explicitly computed. Credit for implementing the qr_multiply
function goes to Martin Teichmann.