When an optimization problem has multiple global minima, different algorithms can find different solutions, a phenomenon often referred to as the implicit bias of optimization algorithms. In this post we'll characterize the implicit bias of gradient-based methods on a class of regression problems that includes linear least squares and Huber …

This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution.

While the most common accelerated methods like Polyak and Nesterov incorporate a momentum term, a little known fact is that simple gradient descent –no momentum– can achieve the same rate
through only a well-chosen sequence of step-sizes. In this post we'll derive this method and through simulations discuss its practical …

I've seen things you people wouldn't believe.
Valleys sculpted by trigonometric functions.
Rates on fire off the shoulder of divergence.
Beams glitter in the dark near the Polyak gate.
All those landscapes will be lost in time, like tears in rain. Time to halt.

We can tighten the analysis of gradient descent with momentum through a cobination of Chebyshev polynomials of the first and second kind. Following this connection, we'll derive one of the most iconic methods in optimization: Polyak momentum.

There's a fascinating link between minimization of quadratic functions and polynomials. A link
that goes
deep and allows to phrase optimization problems in the language of polynomials and vice versa.
Using this connection, we can tap into centuries of research in the theory of polynomials and
shed new light on …

A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. This post takes a closer look into the source of these instabilities and discusses more robust Python implementations.

This blog post extends the convergence theory from the first part of these notes on the
Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize
and strengthen the convergence guarantees obtained in the first part.