Keep the gradient flowing

On the Link Between Optimization and Polynomials, Part 6.

The Cost of Differentiating Through Optimization

Differentiating through optimization is a fundamental problem in hyperparameter optimization, dataset distillation, meta-learning and optimization as a layer, to name a few. In this blog post we'll look into one of the main approaches to differentiate through optimization: unrolled differentiation. With the help of polynomials, we'll be able to derive …

Optimization Nuggets: Stochastic Polyak Step-size, Part 2

Faster rates under strong convexity

This blog post discusses the convergence rate of the Stochastic Gradient Descent with Stochastic Polyak Step-size (SGD-SPS) algorithm for minimizing a finite sum objective. Building upon the proof of the previous post, we show that the convergence rate can be improved to O(1/t) under the additional assumption that …

Optimization Nuggets: Stochastic Polyak Step-size

A simple step-size tuner with optimal rates

The stochastic Polyak step-size (SPS) is a practical variant of the Polyak step-size for stochastic optimization. In this blog post, we'll discuss the algorithm and provide a simple analysis for convex objectives with bounded gradients.

Notes on the Frank-Wolfe Algorithm, Part III: backtracking line-search

Backtracking step-size strategies (also known as adaptive step-size or approximate line-search) that set the step-size based on a sufficient decrease condition are the standard way to set the step-size on gradient descent and quasi-Newton methods. However, these techniques are much less common for Frank-Wolfe-like algorithms. In this blog post I …

On the Link Between Optimization and Polynomials, Part 5

Cyclical Step-sizes.


Six: All of this has happened before.
Baltar: But the question remains, does all of this have to happen again?
Six: This time I bet no.
Baltar: You know, I've never known you to play the optimist. Why the change of heart?
Six: Mathematics. Law of averages. Let a complex …

Optimization Nuggets: Exponential Convergence of SGD

This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution.

On the Link Between Optimization and Polynomials, Part 4

Acceleration without Momentum.

While the most common accelerated methods like Polyak and Nesterov incorporate a momentum term, a little known fact is that simple gradient descent –no momentum– can achieve the same rate through only a well-chosen sequence of step-sizes. In this post we'll derive this method and through simulations discuss its practical …

On the Link Between Optimization and Polynomials, Part 3

A Hitchhiker's Guide to Momentum.

I've seen things you people wouldn't believe.
Valleys sculpted by trigonometric functions.
Rates on fire off the shoulder of divergence.
Beams glitter in the dark near the Polyak gate.
All those landscapes will be lost in time, like tears in rain.
Time to halt.

A momentum optimizer *

On the Link Between Optimization and Polynomials, Part 2

Momentum: when Chebyshev meets Chebyshev.

We can tighten the analysis of gradient descent with momentum through a cobination of Chebyshev polynomials of the first and second kind. Following this connection, we'll derive one of the most iconic methods in optimization: Polyak momentum.

On the Link Between Polynomials and Optimization, Part 1

Residual Polynomials and the Chebyshev method.

There's a fascinating link between minimization of quadratic functions and polynomials. A link that goes deep and allows to phrase optimization problems in the language of polynomials and vice versa. Using this connection, we can tap into centuries of research in the theory of polynomials and shed new light on …

Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis

This blog post extends the convergence theory from the first part of these notes on the Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize and strengthen the convergence guarantees obtained in the first part.

Three Operator Splitting

I discuss a recently proposed optimization algorithm: the Davis-Yin three operator splitting.

Notes on the Frank-Wolfe Algorithm, Part I

This blog post is the first in a series discussing different theoretical and practical aspects of the Frank-Wolfe algorithm.

$$ \def\xx{\boldsymbol x} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\dd …

Optimization inequalities cheatsheet

Most proofs in optimization consist in using inequalities for a particular function class in some creative way. This is a cheatsheet with inequalities that I use most often. It considers class of functions that are convex, strongly convex and $L$-smooth.

A fully asynchronous variant of the SAGA algorithm

My friend Rémi Leblond has recently uploaded to ArXiv our preprint on an asynchronous version of the SAGA optimization algorithm.

The main contribution is to develop a parallel (fully asynchronous, no locks) variant of the SAGA algorighm. This is a stochastic variance-reduced method for general optimization, specially adapted for problems …

Hyperparameter optimization with approximate gradient

TL;DR: I describe a method for hyperparameter optimization by gradient descent.

Most machine …