My latest research paper[^{1}] deals with the estimation of the hemodynamic response function (HRF) from fMRI data.

This is an important topic since the knowledge of a hemodynamic response function is what makes it possible to extract the brain activation maps that are used in most of the impressive applications of machine learning to fMRI, such as (but not limited to) the reconstruction of visual images from brain activity [^{2}] [^{3}] or the decoding of numbers [^{4}].

Besides the more traditional paper that describes the method, I've put online the code I used for the experiments. The code at this stage is far from perfect but it should help in reproducing the results or improving the method. I've also put online an ipython notebook with the analysis of a small piece of data. I'm obviously glad to receive feedback/bug reports/patches for this code.

One of the lesser known features of the memory_profiler package is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now at Google.

With this feature it is possible to generate very easily a plot of the memory consumption as a function of time. The result will be something like this:

where you can see the memory used (in the y-axis) as a function of time (x-axis). Furthermore, we have used two functions, test1 and test2, and it is possible to see with the colored brackets at what time do these functions start and finish.

This plot was generated with the following simple script:

what happens here is that we have two functions, test1() and test2() in which we create two lists of different sizes (the one in test2 is bigger). We call time.sleep() for one second so that the function does not return too soon and so we have time to get reliable memory measurements.

The decorator @profile is optional and is useful so that memory_profiler knows when the function has been called so he can plot the brackets indicating that. If you don't put the decorator, the example will work just fine except that the brackets will not appear in your plot.

Suppose we have saved the script as test1.py. We run the script as

$ mprof run test1.py

where mprof is an executable provided by memory_profiler. If the above command was successful it will print something like this

$ mprof run test1.py
mprof: Sampling memory every 0.1s
running as a Python program...

The above command will create a .dat file on your current working directory, something like mprofile_20141108113511.dat. This file (you can inspect it, it's a text file) contains the memory measurements for your program.

You can now plot the memory measurements with the command

$ mprof plot

This will open a matplotlib window and show you the plot:

As you see, attention has been paid to the default values so that the plot it generates already looks decent without much effort. The not-so-nice-part is that, at least as of November 2014, if you want to customize the plot, well, you'll have to look and modify the mprof script. Some refactoring is still needed in order to make it easier to customize the plots (work in progress).

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given $n$ training samples $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$, where $X_i$ belongs to some sample space $\mathcal{X}$, usually $\mathbb{R}^p$ but for the purpose of this post we can keep i abstract, and $y_i \in \{-1, 1\}$ is an integer representing the class label.

We are also given a loss function $\ell: \{-1, 1\} \times \{-1, 1\} \to \mathbb{R}$ that measures the error of a given prediction. The value of the loss function $\ell$ at an arbitrary point $(y, \hat{y})$ is interpreted as the cost incurred by predicting $\hat{y}$ when the true label is $y$. In classification this function is often the zero-one loss, that is, $\ell(y, \hat{y})$ is zero when $y = \hat{y}$ and one otherwise.

The goal is to find a function $h: \mathcal{X} \to [k]$, the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function $h$ that minimizes the expected $\ell$-risk, given by
$$
\mathcal{R}_{\ell}(h) = \mathbb{E}_{X \times Y}[\ell(Y, h(X))]
$$

In theory, we could directly minimize the $\ell$-risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of $X \times Y$ is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function $\ell$ is discontinuous. Take for example a problem in which $\mathcal{X} = \mathbb{R}^2, k=2$, and we seek to find the linear function $f(X) = \text{sign}(X w), w \in \mathbb{R}^2$ and that minimizes the $\ell$-risk. As a function of the parameter $w$ this function looks something like

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function $\Psi: \mathbb{R} \to \mathbb{R}_+$. An example of such surrogate loss functions is the hinge loss, $\Psi(t) = \max(1-t, 0)$, which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, $\Psi(t) = 1/(1 + \exp(-t))$, used by the logistic regression model. If we consider the logistic loss, minimizing the $\Psi$-risk, given by $\mathbb{E}_{X \times Y}[\Psi(Y, f(X))]$, of the function $f(X) = X w$ becomes a much more more tractable optimization problem:

In short, we have replaced the $\ell$-risk which is computationally difficult to optimize with the $\Psi$-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the $\Psi$-risk leads to a function that also minimizes the $\ell$-risk is often referred to as consistency or calibration. For a more formal definition see [^{1}] and [^{2}]. This property will depend on the surrogate function $\Psi$: for some functions $\Psi$ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [^{1}] and states that if $\Psi$ is convex then it is consistent if and only if it is differentiable at zero and $\Psi'(0) < 0$. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. ↩↩

A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. ↩

As part of the development of
memory_profiler I've tried
several ways to get memory usage of a program from within Python. In this post
I'll describe the different alternatives I've tested.

The psutil library

psutil is a python library that provides
an interface for retrieving information on running processes. It provides
convenient, fast and cross-platform functions to access the memory usage of a
Python module:

defmemory_usage_psutil():# return the memory usage in MBimportpsutilprocess=psutil.Process(os.getpid())mem=process.get_memory_info()[0]/float(2**20)returnmem

The above function returns the memory usage of the current Python process in
MiB. Depending on the platform it will choose the most accurate and fastest
way to get this information. For example, in Windows it will use the C++ Win32
API while in Linux it will read from /proc, hiding the implementation
details and proving on each platform a fast and accurate measurement.

If you are looking for an easy way to get the memory consumption within Python
this in my opinion your best shot.

The resource module

The resource module is part
of the standard Python library. It's basically a wrapper around
getrusage,
which is a POSIX standard but some methods are still missing in
Linux . However, the ones we are
interested seem to work fine in Ubuntu 10.04. You can get the memory usage
with this function:

defmemory_usage_resource():importresourcerusage_denom=1024.ifsys.platform=='darwin':# ... it seems that in OSX the output is different units ...rusage_denom=rusage_denom*rusage_denommem=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/rusage_denomreturnmem

In my experience this approach is several times faster than the one based in
psutil as was the default way to get the memory usage that I used in
memory_profiler from version 0.23 up to 0.26. I changed this behavior in
0.27 after a bug report by Philippe Gervais.
The problem with this approach is that it seems to report results that are
slightly different in some cases. Notably it seems to differ when objects
have been recently liberated from the python interpreter.

In the following example, orphaned arrays are liberated by the python
interpreter, which is correctly seen by psutil but not by resource:

By the way I would be delighted to be corrected if I'm doing something wrong
or informed of a workaround if this exists (I've got the code to reproduce the
figures ^{1})

querying ps directly

The method based on psutils works great but is not available by default on all
Python systems. Because of this in memory_profiler we use as last resort
something that's pretty ugly but works reasonably well when all else fails:
invoking the system's ps command and parsing the output. The code is
something like::

The main disadvantage of this approach is that it needs to fork a process for
each measurement. For some tasks where you need to get memory usage very fast,
like in line-by-line memory usage then this be a huge overhead on the code.
For other tasks such as getting information of long-running processes, where
the memory usage is anyway working on a separate process this is not too bad.

benchmarks

Here is a benchmark of the different alternatives presented above. I am
plotting the time it takes the different approaches to make 100 measurements
of the memory usage (lower is better). As can be seen the smallest one is
resource (although it suffers from the issues described above) followed
closely by psutil which is in my opinion the best option if you can count on
it being installed on the host system and followed far away by ps which is
roughly a hundred times slower than psutil.

IPython notebook to reproduce the figures: htmlipynb↩

In this post I compar several implementations of
Logistic Regression. The task was to implement a Logistic Regression model
using standard optimization tools from scipy.optimize and compare
them against state of the art implementations such as
LIBLINEAR.

In this blog post I'll write down all the implementation details of this
model, in the hope that not only the conclusions but also the process would be
useful for future comparisons and benchmarks.

Function evaluation

We consider the case in which the decision function is an affine function, i.e., $f(x) = \langle x, w \rangle + c$ where $w$ and $c$ are parameters to estimate. The loss function for the $\ell_2$-regularized logistic regression, i.e. the
function to be minimized is

$$
\mathcal{L}(w, \lambda, X, y) = - \frac{1}{n}\sum_{i=1}^n \log(\phi(y_i (\langle X_i, w \rangle + c))) + \frac{\lambda}{2} w^T w
$$

where $\phi(t) = 1. / (1 + \exp(-t))$ is the logistic
function, $\lambda w^T w$ is
the regularization term and $X, y$ is the input data, with $X \in
\mathbb{R}^{n \times p}$ and $y \in \{-1, 1\}^n$. However, this formulation is
not great from a practical standpoint. Even for not unlikely values of $t$
such as $t = -100$, $\exp(100)$ will overflow, assigning the loss an
(erroneous) value of $+\infty$. For this reason ^{1}, we evaluate
$\log(\phi(t))$ as

$$
\log(\phi(t)) =
\begin{cases}
- \log(1 + \exp(-t)) \text{ if } t > 0 \\
t - \log(1 + \exp(t)) \text{ if } t \leq 0\\
\end{cases}
$$

Similarly, the logistic function $\phi$ used here can be computed in a more
stable way using the formula

$$
\phi(t) = \begin{cases}
1 / (1 + \exp(-t)) \text{ if } t > 0 \\
\exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\
\end{cases}
$$

Finally, we will also need the Hessian for some second-order methods, which is given by

$$
\nabla_w ^2 \mathcal{L} = X^T D X + \lambda I
$$

where $I$ is the identity matrix and $D$ is a diagonal matrix given by $D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i))$.

In Python, these function can be written as

importnumpyasnpdefphi(t):# logistic function, returns 1 / (1 + exp(-t))idx=t>0out=np.empty(t.size,dtype=np.float)out[idx]=1./(1+np.exp(-t[idx]))exp_t=np.exp(t[~idx])out[~idx]=exp_t/(1.+exp_t)returnoutdefloss(x0,X,y,alpha):# logistic loss function, returns Sum{-log(phi(t))}w,c=x0[:X.shape[1]],x0[-1]z=X.dot(w)+cyz=y*zidx=yz>0out=np.zeros_like(yz)out[idx]=np.log(1+np.exp(-yz[idx]))out[~idx]=(-yz[~idx]+np.log(1+np.exp(yz[~idx])))out=out.sum()/X.shape[0]+.5*alpha*w.dot(w)returnoutdefgradient(x0,X,y,alpha):# gradient of the logistic lossw,c=x0[:X.shape[1]],x0[-1]z=X.dot(w)+cz=phi(y*z)z0=(z-1)*ygrad_w=X.T.dot(z0)/X.shape[0]+alpha*wgrad_c=z0.sum()/X.shape[0]returnnp.concatenate((grad_w,[grad_c]))

Benchmarks

I tried several methods to estimate this $\ell_2$-regularized logistic regression. There is
one first-order method (that is, it only makes use of the gradient and not of
the Hessian), Conjugate
Gradient
whereas all the others are Quasi-Newton methods. The method I tested are:

CG = Conjugate Gradient as implemented in scipy.optimize.fmin_cg

TNC = Truncated Newton as implemented in scipy.optimize.fmin_tnc

BFGS = Broyden–Fletcher–Goldfarb–Shanno method, as implemented in scipy.optimize.fmin_bfgs.

L-BFGS = Limited-memory BFGS as implemented in scipy.optimize.fmin_l_bfgs_b. Contrary to the BFGS algorithm, which is written in Python, this one wraps a C implementation.

Trust Region = Trust Region Newton method ^{1}. This is the solver used by LIBLINEAR that I've wrapped to accept any Python function in the package pytron

To assure the most accurate results across implementations, all timings were
collected by callback functions that were called from the algorithm on each
iteration. Finally, I plot the maximum absolute value of the gradient (=the
infinity norm of the gradient) with respect to time.

The synthetic data used in the benchmarks was generated as described in ^{2} and consists
primarily of the design matrix $X$ being Gaussian noise, the vector of
coefficients is drawn also from a Gaussian distribution and the explained
variable $y$ is generated as $y = \text{sign}(X w)$. We then perturb matrix
$X$ by adding Gaussian noise with covariance 0.8. The number of samples and features
was fixed to $10^4$ and $10^3$ respectively. The penalization parameter $\lambda$ was
fixed to 1.

In this setting variables are typically uncorrelated and most solvers perform
decently:

Here, the Trust Region and L-BFGS solver perform almost equally good, with
Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised
by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.

To make things more interesting, we now make the design to be slightly more
correlated. We do so by adding a constant term of 1 to the matrix $X$ and
adding also a column vector of ones this matrix to account for the intercept. These are the results:

Here, we already see that second-order methods dominate over first-order
methods (well, except for BFGS), with Trust Region clearly dominating the
picture but with TNC not far behind.

Finally, if we force the matrix to be even more correlated (we add 10. to the
design matrix $X$), then we have:

Here, the Trust-Region method has the same timing as before, but all other
methods have got substantially worse.The Trust Region
method, unlike the other methods is surprisingly robust to correlated designs.

To sum up, the Trust Region method performs extremely well for optimizing the
Logistic Regression model under different conditionings of the design matrix.
The LIBLINEAR software uses
this solver and thus has similar performance, with the sole exception that the
evaluation of the logistic function and its derivatives is done in C++ instead
of Python. In practice, however, due to the small number of iterations of this
solver I haven't seen any significant difference.

A similar development can be found in the source code of LIBLINEAR, and is probably also used elsewhere. ↩↩

"A comparison of numerical optimizers for logistic regression", P. Minka, URL↩

"Newton's Method for Large Bound-Constrained Optimization Problems", Chih-Jen Lin, Jorge J. More URL↩

TL;DR: I've implemented a logistic ordinal regression or
proportional odds model. Here is the Python code

The logistic ordinal regression model, also known as the
proportional odds was introduced in the early 80s by McCullagh [^{1}, ^{2}]
and is a generalized linear model specially tailored for the case of
predicting ordinal variables, that is, variables that are discrete (as
in classification) but which can be ordered (as in regression). It can
be seen as an extension of the logistic regression model to the
ordinal setting.

Given $X \in \mathbb{R}^{n \times p}$ input data and $y \in
\mathbb{N}^n$ target values. For simplicity we assume $y$ is a
non-decreasing vector, that is, $y_1 \leq y_2 \leq ...$. Just as the
logistic regression models posterior probability $P(y=j|X_i)$ as the
logistic function, in the logistic ordinal regression we model the
cummulative probability as the logistic function. That is,

where $w, \theta$ are vectors to be estimated from the data and $\phi$
is the logistic function defined as $\phi(t) = 1 / (1 + \exp(-t))$.

Compared to multiclass logistic regression, we have added the
constrain that the hyperplanes that separate the different classes are
parallel for all classes, that is, the vector $w$ is common across
classes. To decide to which class will $X_i$ be predicted we make use
of the vector of thresholds $\theta$. If there are $K$ different
classes, $\theta$ is a non-decreasing vector (that is, $\theta_1 \leq
\theta_2 \leq ... \leq \theta_{K-1}$) of size $K-1$. We will then
assign the class $j$ if the prediction $w^T X$ (recall that it's a
linear model) lies in the interval $[\theta_{j-1}, \theta_{j}[$. In
order to keep the same definition for extremal classes, we define
$\theta_{0} = - \infty$ and $\theta_K = + \infty$.

The intuition is that we are seeking a vector $w$ such that $X w$
produces a set of values that are well separated into the different
classes by the different thresholds $\theta$. We choose a logistic
function to model the probability $P(y \leq j|X_i)$ but other choices
are possible. In the proportional hazards model ^{1} the probability
is modeled as $-\log(1 - P(y \leq j | X_i)) = \exp(\theta_j - w^T
X_i)$. Other link functions are possible, where the link function
satisfies $\text{link}(P(y \leq j | X_i)) = \theta_j - w^T X_i$. Under
this framework, the logistic ordinal regression model has a logistic
link function and the proportional hazards model has a log-log link
function.

The logistic ordinal regression model is also known as the
proportional odds model, because the
ratio of corresponding odds
for two different samples $X_1$ and $X_2$ is $\exp(w^T(X_1 - X_2))$ and
so does not depend on the class $j$ but only on the difference between
the samples $X_1$ and $X_2$.

Optimization

Model estimation can be posed as an optimization problem. Here, we
minimize the loss function for the model, defined as minus the
log-likelihood:

In this sum all terms are convex on $w$, thus the loss function is
convex over $w$. It might be also jointly convex over $w$ and
$\theta$, although I haven't checked. I use the function
fmin_slsqp in scipy.optimize to optimize
$\mathcal{L}$ under the constraint that $\theta$ is a non-decreasing
vector. There might be better options, I don't know. If you do know,
please leave a comment!.

Using the formula $\log(\phi(t))^\prime = (1 - \phi(t))$, we can compute the gradient of the loss function as

I've implemented a Python version of this algorithm using Scipy's
optimize.fmin_slsqp function. This takes as arguments the
loss function, the gradient denoted before and a function that is
> 0 when the inequalities on $\theta$ are satisfied.

Code
can be found here as part of the minirank package, which
is my sandbox for code related to ranking and ordinal regression. At
some point I would like to submit it to scikit-learn but right now the
I don't know how the code will scale to medium-scale problems, but I
suspect not great. On top of that I'm not sure if there is a real demand
of these models for scikit-learn and I don't want to bloat the package
with unused features.

Performance

I compared the prediction accuracy of this model in the sense of mean absolute
error (IPython
notebook) on the boston
house-prices dataset. To have an ordinal variable, I
rounded the values to the closest integer, which gave me a problem of
size 506 $\times$ 13 with 46 different target values. Although not a
huge increase in accuracy, this model did give me better results on
this particular dataset:

Here, ordinal logistic regression is the best-performing model,
followed by a Linear Regression model and a One-versus-All Logistic
regression model as implemented in scikit-learn.

"Regression models for ordinal data", P. McCullagh, Journal of
the royal statistical society. Series B (Methodological), 1980 ↩↩

"Generalized Linear Models", P. McCullagh and J. A. Nelder (Book) ↩

"Loss Functions for Preference Levels : Regression with Discrete
Ordered Labels", Jason D. M. Rennie, Nathan Srebro ↩

My latest contribution for scikit-learn is
an implementation of the isotonic regression model that I coded with
Nelle Varoquaux and
Alexandre Gramfort. This model
finds the best least squares fit to a set of points, given the
constraint that the fit must be a non-decreasing
function. The example
on the scikit-learn website gives an intuition on this model.

The original points are in red, and the estimated ones are in
green. As you can see, there is one estimation (green point) for each
data sample (red point). Calling $y \in \mathbb{R}^n$ the input data,
the model can be written concisely as an optimization problem over $x$

$$
\text{argmin}_x |y - x |^2 \\
\text{subject to } x_0 \leq x_1 \leq \cdots \leq x_n
$$

The algorithm implemented in scikit-learn ^{3} is the pool adjacent
violators algorithm ^{1}, which is an efficient linear time
$\mathcal{O}(n)$ algorithm. The algorithm sweeps through the data
looking for violations of the monotonicity constraint. When it finds
one, it adjusts the estimate to the best possible fit with
constraints. Sometimes it also needs to modify previous points to make
sure the new estimate does not violate the constraints. The following
picture shows how it proceeds at each iteration

"Active set algorithms for isotonic regression; A unifying
framework", Michael J. Best, Nilotpal Chakravarti ↩

Householder matrices are square matrices of the form

$$ P = I - \beta v v^T$$

where $\beta$ is a scalar and $v$ is a vector. It has the useful
property that for suitable chosen $v$ and $\beta$ it makes the product
$P x$ to zero out all of the coordinates but one, that is, $P x =
|x| e_1$. The following code, given $x$, finds the values of $\beta,
v$ that verify that property. The algorithm can be found in several
textbooks ^{1}

defhouse(x):""" Given a vetor x, computes vectors v with v[0] = 1 and scalar beta such that P = I - beta v v^T is orthogonal and P x = ||x|| e_1 Parameters ---------- x : array, shape (n,) or (n, 1) Returns ------- beta : scalar v : array, shape (n, 1) """x=np.asarray(x)ifx.ndim==1:x=x[:,np.newaxis]sigma=linalg.norm(x[1:,0])**2v=np.vstack((1,x[1:]))ifsigma==0:beta=0else:mu=np.sqrt(x[0,0]**2+sigma)ifx[0,0]<=0:v[0,0]=x[0,0]-muelse:v[0,0]=-sigma/(x[0,0]+mu)beta=2*(v[0,0]**2)/(sigma+v[0,0]**2)v/=v[0,0]returnbeta,v

As promised, this computes the parameters of $P$ such that $P x = |x| e_1$,
exact to 15 decimals:

This property is what it makes Householder matrices useful in
the context of numerical analysis. It can be used for example to
compute the QR decomposition of a given matrix. The idea is to
succesively zero out the sub-diagonal elements, thus leaving a
triangular matrix at the end. In the first iteration we compute a
Householder matrix $P_0$ such that $P_0 X$ has only zero below the
diagonal of the first column, then compute a Householder matrix $P_1$
such that $P_1 X$ zeroes out the subdiagonal elements of the second
column and so on. At the end we will have that $P_0 P_1 ... P_n X$ is
an upper triangular matrix. Since all $P_i$ are orthogonal, the
product $P_0 P_1 ... P_n$ is again an orthogonal matrix, namely the
$Q$ matrix in the QR decomposition.

If we choose X as 20-by-20 random matrix, with colors representing
different values

we can see the process of the Householder matrices being applied one
by one to obtain an upper triangular matrix

A similar application of Householder matrices is to reduce a given
symmetric matrix to tridiagonal form, which proceeds in a similar way
as in the QR algorithm, only that now we multiply by the matrix $X$ by
the left and right with the Householder matrices. Also, in this case
we seek for Householder matrices that zero out the elements of the
subdiagonal plus one, instead of just subdiagonal elements. This
algorithm is used for example as a preprocessing step for most dense
eigensolvers

"Matrix Computations" third edition, Golub & VanLoan (Algorithm 5.1.1). ↩

Note: this post contains a fair amount of LaTeX, if you don't
visualize the math correctly come to its original location

In machine learning it is common to formulate the classification task
as a minimization problem over a given loss function. Given data input
data $(x_1, ..., x_n)$ and associated labels $(y_1, ..., y_n), y_i \in
\lbrace-1, 1\rbrace$, the problem becomes to find a function $f(x)$
that minimizes

$$L(x, y) = \sum_i ^n \text{loss}(f(x_i), y_i)$$

where loss is any loss function. These are usually functions that
become close to zero when $f(x_i)$ agrees in sign with $y_i$ and have
a non-negative value when $f(x_i)$ have opposite signs. Common choices
of loss functions are:

Zero-one loss, $I(f(x_i) = y_i)$, where $I$ is the indicator function.

Hinge loss, $\text{max}(0, 1 - f(x_i) y_i)$

Logistic loss, $\log(1 + \exp{f(x_i) y_i})$

^{1}

In the paper Loss functions for preference levels: Regression with
discrete ordered
labels,
the above setting that is commonly used in the classification and
regression setting is extended for the ordinal regression problem. In
ordinal regression, classes can take one of several discrete, but
ordered, labels. Think for example on movie ratings that go from zero
to ten stars. Here there's an inherent order in the sense that
(unlike the multiclass classification setting) not all errors are
equally bad. For instance, it is worse to mistake a 1-star with a
10-star than a 4-star movie with a 5-star one.

To extend the binary loss to the case of ordinal regression, the
author introduces K-1 thresholds $-\infty = \theta_0 \lt \theta_1 \lt
... \lt \theta_{K-1}=\infty$. Each of the K segments corresponds to one of
K labels and a predicted value between $\theta_{d}$ and $\theta_{d+1}$
corresponds to the prediction of class $d$ (supposing that classes to
from zero to $d$). This generalizes the binary case by considering
zero the unique threshold.

Suppose K=7 and a that the correct outcome of $y_i$ is 2, that is,
$f(x_i)$ must lie between $\theta_1$ and $\theta_2$. In that case, it
must be verified that $f(x_i) > \theta_0$ and $f(x_i) > \theta_1$ as
well as $f(x_i) < \theta_3 < \theta_4 < \theta_5$. Treating this as
K-1=6 independent classification problems produces the following
family of loss functions (for a hinge loss):

The idea behind Rennie's paper is to sum all these loss function to
produce a single optimization problem that is then the sum of all
threshold violations. Here, not only the function increases when
thresholds are violated, but the slope of the loss function increases
each time a threshold is crossed.

Code for generating all figures can be found here↩

Besides performing a line-by-line analysis of memory consumption,
memory_profiler
exposes some functions that allow to retrieve the memory consumption
of a function in real-time, allowing e.g. to visualize the memory
consumption of a given function over time.

The function to be used is memory_usage. The first argument
specifies what code is to be monitored. This can represent either an
external process or a Python function. In the case of an external
process the first argument is an integer representing its process
identifier (PID). In the case of a Python function, we need pass the
function and its arguments to memory_usage. We do this by passing the
tuple (f, args, kw) that specifies the function, its position
arguments as a tuple and its keyword arguments as a dictionary,
respectively. This will be then executed by memory_usage as
f(*args, **kw).

Let's see this with an example. Take as function NumPy's
pseudo-inverse function. Thus
f = numpy.linalg.pinv and f takes one positional argument (the
matrix to be inverted) so args = (a,) where a is the matrix to be
inverted. Note that args must be a tuple consisting of the different
arguments, thus the parenthesis around a. The third item is a
dictionary kw specifying the keyword arguments. Here kw is optional
and is omitted.

>>>frommemory_profilerimportmemory_usage>>>importnumpyasnp# create a random matrix>>>a=np.random.randn(500,500)>>>mem_usage=memory_usage((np.linalg.pinv,(a,)),interval=.01)>>>print(mem_usage)[57.02734375,55.0234375,57.078125,...]

This has given me a list specifying at different time intervals (t0,
t0 + .01, t0 + .02, ...) at which the measurements where taken. Now I can
use that to for example plot the memory consumption as a function of
time:

>>>importpylabaspl>>>pl.plot(np.arange(len(mem_usage))*.01,mem_usage,label='linalg.pinv')>>>pl.xlabel('Time (in seconds)')>>>pl.ylabel('Memory consumption (in MB)')>>>pl.show()

This will give the memory usage of a single function across time, which
might be interesting for example to detect temporaries that would be
created during the execution.

Another use case for memory_usage would be to see how memory behaves
as input data gets bigger. In this case we are interested in memory as
a function of the input data. One obvious way we can do this is by
calling the same function each time with a different input and take as
memory consumption the maximum consumption over time. This way we will
have a memory usage for each input.

It is now possible to plot these results as a function of the
dimensions.

importnumpyasnpimportpylabasplfrommemory_profilerimportmemory_usagedims=np.linspace(100,1000,10)pinv_mem=np.zeros(dims.size)fori_dim,kinenumerate(dims):x=np.random.randn(k,k)tmp=memory_usage((np.linalg.pinv,(x,)),interval=.01)pinv_mem[i_dim]=np.max(tmp)pl.plot(dims,pinv_mem,label='np.linalg.pinv')pl.ylabel('Memory (in MB)')pl.xlabel('Dimension of the square matrix')pl.legend(loc='upper left')pl.axis('tight')pl.show()