# fa.bianp.net

Most proofs in optimization consist in using inequalities for a particular function class in some creative way. This is a cheatsheet with inequalities that I use most often. It considers class of functions that are convex, strongly convex and $L$-smooth.

Setting. $f$ is a function $\mathbb{R}^p \to \mathbb{R}$. Below are a set of inequalities that are verified when $f$ belongs to a particular class of functions and $x, y \in \mathbb{R}^p$ are arbitrary elements in its domain. For simplicity I'm assuming that functions are differentiable, but most of these are also true replacing the gradient with a subgradient.

$f$ is $L$-smooth. This is the class of functions that are differentiable and its gradient is Lipschitz continuous.

• $\|\nabla f(y) - \nabla f(x) \| \leq L \|x - y\|$
• $|f(y) - f(x) - \langle \nabla f(x), y - x\rangle| \leq \frac{L}{2}\|y - x\|^2$
• $\nabla^2 f(x) \preceq L\qquad \text{ (assuming$f$is twice differentiable)}$

$f$ is convex:

• $f(\lambda x + (1 - \lambda)y) \leq \lambda f(x) + (1 - \lambda)f(y)$ for all $\lambda \in [0, 1]$.
• $f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle$
• $0 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
• $f(\mathbb{E}X) \leq \mathbb{E}[f(X)]$ where $X$ is a random variable (Jensen's inequality).
• $x = \text{prox}_{\gamma f}(x) + \gamma \text{prox}_{f^*/\gamma}(x/\gamma)$, where $f^*$ is the Fenchel conjugate (Moreau's decomposition, )

$f$ is both $L$-smooth and convex:

• $\frac{1}{L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
• $0 \leq f(y) - f(x) - \langle \nabla f(x), y - x\rangle \leq \frac{L}{2}\|x - y\|^2$
• $f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle - \frac{1}{2 L}\|\nabla f(x) - \nabla f(y)\|^2$

$f$ is $\mu$-strongly convex. Set of functions $f$ such that $f - \frac{\mu}{2}\|\cdot\|^2$ is convex. It includes the set of convex functions with $\mu=0$. Here $x^*$ denotes the minimizer of $f$.

• $f(x) \leq f(y) + \langle \nabla f(x), x - y \rangle - \frac{\mu}{2}\|x - y\|^2$
• $\mu\|x - y\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
• $\frac{\mu}{2}\|x-x^*\|^2\leq f(x) - f(x^*)$

$f$ is both $L$-smooth and $\mu$-strongly convex.

• $\frac{\mu L}{\mu + L}\|x - y\|^2 + \frac{1}{\mu + L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$
• $\mu \preceq \nabla^2 f(x) \preceq L \qquad \text{ (assuming$f$is twice differentiable)}$

### References

Most of these inequalities appear in the Book: "Introductory lectures on convex optimization: A basic course" by Nesterov (2013, Springer Science & Business Media). Another good (and free) resource is the book "Convex Optimization" by Stephen Boyd and Lieven Vandenberghe.

My friend Rémi Leblond has recently uploaded to ArXiv our preprint on an asynchronous version of the SAGA optimization algorithm.

The main contribution is to develop a parallel (fully asynchronous, no locks) variant of the SAGA algorighm. This is a stochastic variance-reduced method for general optimization, specially adapted for problems that arise frequently in machine learning such as (regularized) least squares and logistic regression. Besides the specification of the algorithm, we also provide a convergence proof and convergence rates. Furthermore, we fix some subtle technical issues present in previous literature (proving things in the asynchronous setting is hard!).

The core of the asynchronous algorithm is similar to Hogwild!, a popular asynchronous variant of stochastc gradient descent (SGD). The main difference is that instead of using SGD as a building block, we use SAGA. This has many advantages (and poses some challenges): faster (exponential!) rates of convergence and convergence to arbitrary precision with a fixed step size (hence clear stopping criterion), to name a few.

The speedups obtained versus the sequential version are quite impressive. For example, we have observed to commonly obtain 5x-7x speedups using 10 cores:

Update (April 2017): this work has been presented at AISTATS 2017.

TL;DR: I describe a method for hyperparameter optimization by gradient descent.

Most machine learning models rely on at least one hyperparameter to control for model complexity. For example, logistic regression commonly relies on a regularization parameter that controls the amount of $\ell_2$ regularization. Similarly, kernel methods also have hyperparameters that control for properties of the kernel, such as the "width" parameter in the RBF kernel. The fundamental distinction between model parameters and hyperparameters is that, while model parameters are estimated by minimizing a goodness of fit with the training data, hyperparameters need to be estimated by other means (such as a cross-validation loss), as otherwise models with excessive would be selected, a phenomenon known as overfitting. We can use an approximate gradient to optimize a cross-validation loss with respect to hyperparameters. A decreasing bound between the true gradient and the approximate gradient ensures that the method converges towards a local minima.

Fitting hyperparameters is essential to obtain models with good accuracy and computationally challenging. The existing most popular methods for fitting hyperparameters are based on either exhaustively exploring the whole hyperparameter space (grid search and random search) or on Bayesian optimization techniques that use previous function evaluations to guide the optimization procedure. The starting point of this work was a simple question: why are the procedures to estimate parameters and hyperparameters so different? Is it possible to use known and reliable methods such as gradient descent to fit not only parameters, but also hyperparameters?

Interestingly, I found out that this question had been answered a long time ago. Already in the 90s, Larsen et al. devised a method (described here and here) using gradient-descent to estimate the optimal value of $\ell_2$ regularization for neural networks. Shortly after, Y. Bengio also published a paper on this topic. Recently, there has been a renewed interest in gradient-based methods (see for example this paper by Maclaurin or a slightly earlier work by Justin Domke, and references therein).

One of the drawbacks of gradient-based optimization of hyperparameters, is that these depend on quantities that are costly to compute such as the exact value of the model parameters and the inverse of a Hessian matrix. The aim of this work is to relax some of these assumptions and provide a method that works when the quantities involved (such as model parameters) are known only approximately. In practice, what this means is that hyperparameters can be updated before model parameters have fully converged, which results in big computational gains. For more details and experiments, please take a look at the paper.

This paper was presented at the International Conference on Machine Learning (ICML 2016).Code is now available in github and these are the slides I used for the occasion:

### Reviews

The original ICML reviews for this paper can be seen here and the rebuttal here. These were high quality reviews that had some rightful concenrs with the first version of the manuscript. In fact, 2 out of 3 reviewers gave a "weak reject" in this first phase. Luckily these concerns could be contested in the rebuttal (the final manuscript was updated accordingly) and the 2 reviewers that gave a "weak reject" changed their rating to "weak accept" and the paper was finally accepted.

### (Relatively) Frequently Asked Questions

• The outer loss function does not depend directly on the regularization parameter $\lambda$. Why is it there?

When the outer loss is a cross-validation loss this is true, but other criteria might depend on this parameter, such as the SURE criteria (see e.g. Deladalle et al. 2014).

### Citing

Please cite this work if the paper or its associated code are relevant for you. You can use the following bibtex:

  @inproceedings{PedregosaHyperparameter16,
author    = {Fabian Pedregosa},
title     = {Hyperparameter optimization with approximate gradient},
booktitle = {Proceedings of the 33nd International Conference on Machine Learning ({ICML})},
year      = {2016},
url       = {http://jmlr.org/proceedings/papers/v48/pedregosa16.html},
}


### Erratum

An early version of the paper contained the following typo: the first equation in Theorem 2 should read $\sum_{i=1}^\infty \varepsilon_i < \infty$ instead of $\sum_{i=1}^\infty \varepsilon_i \leq \infty$ (note the $<$ in the first equation versus $\leq$ in the second one). This typo has been corrected in both ArXiv version and the proceedings.

Announce: first public release of lightning!, a library for large-scale linear classification, regression and ranking in Python. The library was started a couple of years ago by Mathieu Blondel who also contributed the vast majority of source code. I joined recently its development and decided it was about time for a v0.1!.

Prebuild conda packages are available for all operating systems (god thank appveyor). More information on lightning's website.

Together with other scikit-learn developers we've created an umbrella organization for scikit-learn-related projects named scikit-learn-contrib. The idea is for this organization to host projects that are deemed too specific or too experimental to be included in the scikit-learn codebase but still offer an API which is compatible with scikit-learn and would like to benefit of the visibility of being labeled as scikit-learn-compatible.

We've set two requirements to being under this umbrella right now (this might evolve in the future). The first requirement is to have a scikit-learn compatible API, i.e., to follow the guide on the scikit-learn documentation so that objects can be used by scikit-learn meta-estimators (such as GridSearchCV). The second condition is that projects should be actively maintaned and have a high-quality codebase. Judging the quality of a codebase is difficult and subjective, but we agreed that at the bare minimum, the source code should be tested using continuous integration tools such as travis and reach a good test coverage (above 80%). More information is not available on the scikit-learn-contrib repository.

The first project to be hosted by this organization is lightning, but we hope that others will follow. If you would like to submit a new project, open an issue at the main project and we will look into it. There is also a project template for new and old projects.

Recently I've implemented, together with Arnaud Rachez, the SAGA[1] algorithm in the lightning machine learning library (which by the way, has been recently moved to the new scikit-learn-contrib project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA algorithm, its performance is similar to other variance-reduced stochastic algorithms such as SAG[3] or SVRG[2] but it has the advantage with respect to SAG[3] that it allows non-smooth penalty terms (such as $\ell_1$ regularization). It is implemented in lightning as SAGAClassifier and SAGARegressor.

We have taken care to make this implementation as efficient as possible. As for most stochastic gradient algorithms, a naive implementation takes 3 lines of code and is straightforward to implement. However, there are many tricks that are time-consuming and error-prone to implement but make a huge difference in efficiency.

A small example, more as a sanity check than to claim anything. The following plot shows the suboptimality as a function of time for three similar methods: SAG, SAGA and SVRG. The dataset used in the RCV1 dataset (test set, obtained from the libsvm webpage), consisting of 677.399 samples and 47.236 features. Interestingly, all methods can solve this rather large-scale problem within a few seconds. Within them, SAG and SAGA have a very similar performance and SVRG seems to be reasonably faster.

A note about the benchmarks: it is difficult to compare fairly stochastic gradient methods because at the end it usually boils down to how you choose the step size. In this plot I set the step size of all methods to 1/(3L) , where L is the Lipschitz constant of the objective function, as I think this is a popular choice. I would have prefered 1/L but SVRG was not converging for this step size. The code for the benchmarks can be found here.

1. A. Defazio, F. Bach & S. Lacoste-Julien. "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives" (2014).

2. Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in Neural Information Processing Systems. 2013.

3. Mark Schmidt, Nicolas Le Roux, and Francis Bach. "Minimizing finite sums with the stochastic average gradient." arXiv preprint arXiv:1309.2388 (2013).

My latests work (with Francis Bach and Alexandre Gramfort) is on the consistency of ordinal regression methods. It has the wildly imaginative title of "On the Consistency of Ordinal Regression Methods" and is currently under review but you can read the draft of it on ArXiv. If you have any thoughts about it, please leave me a comment!

Update July 2017: this paper was published on the Journal of Machine Learning Research. The published version can be found here

### Ordinal what?

The problem of ordinal regression is an old one in supervised learning. Its roots can be traced back to the works of McCullagh[1] in the 80s. It is a supervised learning problem that shares properties --yet is fundamentally different-- with both multiclass classification and regression. It can be seen as the problem of predicting a target variable from labeled observations, where the target label consists of discrete and ordered labels. As in the multiclass classification setting, the target variables are of discrete nature, and as in the regression setting (but unlike the multiclass setting) there is a meaningful order between the classes.

The most popular example of ordinal regression arise when the target variable is a human generated rating. For example, for a movie recommendation system, the target variable can have the possible values “do-not-bother” ≺ “only-if-you-must” ≺ “good” ≺ “verygood” ≺ “run-to-see”. Using multiclass classification to predict this target would yield a suboptimal classifier since it ignores the fact that there is a natural ordering between the labels. On the other hand, a regression algorithm assumes that the target variable is continuous, while here it is clearly discrete. Ordinal regression would be the ideal model for this target variable.

### Fisher consistency

The notion of Fisher consistency is also an old one in statistics, and goes back to the work of Fisher at the beginning of the 20th century. The rigorous definition is stated in the paper, so here I'll just give an intuition.

In supervised learning, we observe random samples (in the form of pairs (target, sample) usually denoted $(y_i, X_i)$ ) from a population (lets call it P) and build a model that predicts the target when he seems a new sample. Fisher consistency can be seen as a sanity check on the learning model, that states that if instead of seeing a random sample we would have access to the full population P (which in real life never happens), then our classifier would have an accuracy as good as the best possible accuracy (such classifier is usually called Bayes rule or Bayes predictor).

Having Fisher consistency is an important property that "allows us to design good loss functions with desirable properties"[7]. Because of this, in the last decade the Fisher consistency of most used supervised learning methods has been investigated. It has been shown (see e.g.[2]) that most common methods for binary classification are consistent. For the multiclass case and ranking, the situation is more interesting, with some methods that are known to be inconsistent, such as one-vs-all SVM in multiclass classification and RankSVM.

The study of Fisher consistency for ordinal regression methods has been done for the first time (to the best of my knowledge) here and proves that despite the negative results of multiclass classification[3][4] and ranking[5][6], common ordinal regression methods are Fisher consistent. This brings ordinal regression closer to binary classification than to multiclass classification in this respect. And in fact, some results in the paper can be seen as generalization of known results for binary classification.

### Highlights

In the paper we study the Fisher consistency of some popular ordinal regression methods. The methods that we analyze are the following (see Table 1 in the paper of a definition): all threshold, cumulative link, immediate threshold and last absolute deviation. In the paper we present the following results

• We characterize the consistency of all threshold method and immediate threshold by the derivative at zero of an auxiliary function.
• We provide an excess risk bound of the all threshold loss.
• Prove consistency of the least absolute deviation. This was already done for the case of three classes by Ramaswamy et al.[8], here we extend the proof to an arbitrary number of classes.
• Prove consistency of cumulative link model (a model that includes the venerable proportional odds model of McCullagh1).
• The consistency analysis suggest a novel loss function when optimizing a least squares metric. We test this novel model on different datasets and report that it performs very competitively.

1. McCullagh, Peter. "Regression models for ordinal data." Journal of the royal statistical society. Series B (Methodological) (1980).

2. Bartlett, Peter L., Michael I. Jordan, and Jon D. McAuliffe. "Convexity, classification, and risk bounds." Journal of the American Statistical Association (2006).

3. Tewari, Ambuj, and Peter L. Bartlett. "On the consistency of multiclass classification methods." The Journal of Machine Learning Research 8 (2007).

4. Zhang, Tong. "Statistical analysis of some multi-category large margin classification methods." The Journal of Machine Learning Research 5 (2004).

5. Duchi, John C., Lester W. Mackey, and Michael I. Jordan. "On the consistency of ranking algorithms." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010.

6. Calauzenes, Clément, Nicolas Usunier, and Patrick Gallinari. "On the (non-) existence of convex, calibrated surrogate losses for ranking." Neural Information Processing Systems. 2012.

7. Zhang, Tong. "Statistical behavior and consistency of classification methods based on convex risk minimization." Annals of Statistics (2004)

8. Ramaswamy, Harish G., and Shivani Agarwal. "Classification calibration dimension for general multiclass losses." Advances in Neural Information Processing Systems. 2012.

Cross-validation iterators in scikit-learn are simply generator objects, that is, Python objects that implement the __iter__ method and that for each call to this method return (or more precisely, yield) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API.

import numpy as np
from sklearn.utils import check_random_state

class HoldOut:
"""
Hold-out cross-validator generator. In the hold-out, the
data is split only once into a train set and a test set.
Unlike in other cross-validation schemes, the hold-out
consists of only one iteration.

Parameters
----------
n : total number of samples
test_size : 0 < float < 1
Fraction of samples to use as test set. Must be a
number between 0 and 1.
random_state : int
Seed for the random number generator.
"""
def __init__(self, n, test_size=0.2, random_state=0):
self.n = n
self.test_size = test_size
self.random_state = random_state

def __iter__(self):
n_test = int(np.ceil(self.test_size * self.n))
n_train = self.n - n_test
rng = check_random_state(self.random_state)
permutation = rng.permutation(self.n)
ind_test = permutation[:n_test]
ind_train = permutation[n_test:n_test + n_train]
yield ind_train, ind_test


Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of Theory of classification: a survey of some recent advances and the very recent The reusable holdout.

TL;DR I created a gallery for IPython/Jupyter notebooks. Check it out :-)

A couple of months ago I put online a website that displays a collection of IPython/Jupyter notebooks. The is a website that collects user-submitted and publicly available notebooks and displays them with a nice screenshot. The great thing about this website compared to other similar efforts is that this one updates and categorizes the notebooks (by date and views) automatically. You can even search for anything in this database that contains already more than 400 notebooks!

## Vision

I would like to make a website where it is possible to

1. Find the notebook you are looking for. There is precious information (examples, documentation, tutorials, etc.) contained within IPython/Jupyter notebooks. It should be easy to find this information.

2. Discover new notebooks. I would like to see new notebooks as they are submitted in order to discover new notebooks, possibly about new technologies that I did not know about.

## How it works

Under the hood there's a django app, for which the source code lives here (don't hesitate to use the issues feature in github to suggest features or to report bugs). To propose new notebooks there's a tab where anyone can leave the URL of a notebook. If all goes right, the django app will take a screenshot will be taken and incorporated into the database. The only non-trivial part of this process is to take the screenshot, for which I use a bit of javascript around phantomjs.

## Future plans

What you see is just the beginning!. As time permits I would like to implement user authentication so that registered users can bookmark their favorite notebooks, up and down-vote notebooks etc. Cathegorization of notebooks (e.g. in bins such as Math, Phisics, Machine Learning, R, Python, Julia, etc.) is also high on my list. Leave me a comment if you would like to see some specific feature!

Last Friday was PyData Paris, in words of the organizers, ''a gathering of users and developers of data analysis tools in Python''.

The organizers did a great job in putting together and the event started already with a full room for Gael's keynote

My take-away message from the talks is that Python has grown in 5 years from a language marginally used in some research environments into one of the main languages for data science used both in research labs and industrial environment.

My personal highlights were (note that there were two parallel tracks)

• Ian Ozsvald's talk on Cleaning Confused Collections of Characters. Ian gave a very practical talk, full of real world examples. The slides have already been uploaded on his website. Many tips and many pointers to libraries. In particular, I discovered fixes text for you.

• Chloe-Agathe gave a short talk on DREAM challenges. In her talk she mentioned GPy. One year ago, I visited Neil Lawrence at his lab in Sheffield and at that point they were in the process of migrating their Matlab codebase into Python (the GPy project). I'm very glad to see that the project is succeeding and being used by other research institutions.

• Serge Guelton and Pierrick Brunet presented “Pythran: Static Compilation of Parallel Scientific Kernels”. From their own documentation: “Pythran is a python to c++ compiler for a subset of the python language. It takes a python module annotated with a few interface description and turns it into a native python module with the same interface, but (hopefully) faster”. The project seems promising although I do not have had experience as to judge the quality of their implementation.

• Antoine Pitrou presented: “Numba, a JIT compiler for fast numerical code”. I must say that I'm an avid user of Numba so of course I was looking forward to this talk. One thing I didn't know is that support for CUDA is being implemented into Numba via the @cuda.jit decorator. From their website it looks like this is only available in the Numba Pro version (not free).

• Kirill Smelkov presented wendelin.core, an approach to perform out-of-core computations with numpy. Slides can be found here.

• Finally, Frances Alted gave the final keynote on “New Trends In Storing And Analyzing Large Data Silos With Python”. Among the projects he mentioned, I found particularly interesting bcolz, his current main project and DyND, a Python wrapper around a multi-dimensional array library.