fa.bianp.net

On the Link Between Optimization and Polynomials, Part 3

A Hitchhiker's Guide to Momentum.

I've seen things you people wouldn't believe.
Valleys sculpted by trigonometric functions.
Rates on fire off the shoulder of divergence.
Beams glitter in the dark near the Polyak gate.
All those landscapes will be lost in time, like tears in rain.
Time to halt.

A momentum optimizer *

$$ \def\aa{\boldsymbol a} \def\rr{\boldsymbol r} \def\AA{\boldsymbol A} \def\HH{\boldsymbol H} \def\EE{\mathbb E} \def\II{\boldsymbol I} \def\CC{\boldsymbol C} \def\DD{\boldsymbol D} \def\KK{\boldsymbol K} \def\eeps{\boldsymbol \varepsilon} \def\tr{\text{tr}} \def\LLambda{\boldsymbol \Lambda} \def\bb{\boldsymbol b} \def\cc{\boldsymbol c} \def\xx{\boldsymbol x} \def\zz{\boldsymbol z} \def\uu{\boldsymbol u} \def\vv{\boldsymbol v} \def\qq{\boldsymbol q} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\pp{\boldsymbol p} \def\lmax{L} \def\lmin{\ell} \def\RR{\mathbb{R}} \def\TT{\boldsymbol T} \def\QQ{\boldsymbol Q} \def\CC{\boldsymbol C} \def\Econd{\boldsymbol E} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\dom}{\mathbf{dom}} \DeclareMathOperator*{\Fix}{\mathbf{Fix}} \DeclareMathOperator{\prox}{\mathbf{prox}} \DeclareMathOperator{\span}{\mathbf{span}} \DeclareMathOperator{\Ima}{Im} \def\defas{\stackrel{\text{def}}{=}} \def\dif{\mathop{}\!\mathrm{d}} \definecolor{colormomentum}{RGB}{27, 158, 119} \definecolor{colorstepsize}{RGB}{217, 95, 2} \def\mom{{\color{colormomentum}m}} \def\step{{\color{colorstepsize}h}} $$

Momentum

Gradient descent with momentum–momentum for short– This method is sometimes also referred to as Heavy Ball. are a class of optimization methods designed to solve unconstrained minimization problems of the form \begin{equation} \argmin_{\xx \in \RR^d} f(\xx)\,, \end{equation} where the objective function $f$ is differentiable and we have access to its gradient $\nabla f$. In this method the update is a sum of two terms. The first term is the difference between the current and the previous iterate $(\xx_{t} - \xx_{t-1})$, also known as momentum term. The second term is the gradient $\nabla f(\xx_t)$ of the objective function.

Gradient Descent with Momentum
Input: starting guess $\xx_0$, step-size $\step > 0$ and momentum parameter $\mom \in (0, 1)$.
$\xx_1 = \xx_0 - \dfrac{\step}{\mom+1} \nabla f(\xx_0)$
For $t=1, 2, \ldots$ compute \begin{equation}\label{eq:momentum_update} \xx_{t+1} = \xx_t + \mom(\xx_{t} - \xx_{t-1}) - \step\nabla f(\xx_t) \end{equation}

Despite its simplicity, momentum exhibits unexpectedly rich dynamics that we'll explore on this post. An excellent paper that explores the dynamics of momentum is Gabriel Goh's Why Momentum Really Works.

The landscape described in the section "The Dynamics of Momentum" are different than the ones described in this post. The ones in Gabriel's paper describe the improvement along the direction of a single eigenvector, and as such are at best a partial view of convergence (in fact, they give the misleading impression that the best convergence is achieved with zero momentum). The techniques used in this post allow instead to compute the global convergence rate, and figures like the above correctly predict a non-trivial optimal momentum term.

Convergence rates

In the last post we saw that, assuming $f$ is a quadratic objective of the form \begin{equation}\label{eq:opt} f(\xx) \defas \frac{1}{2}\xx^\top \HH \xx + \bb^\top \xx~, \end{equation} then we can analyze momentum using a convex combination of Chebyshev polynomials of the first and second kind.

More precisely, we showed that the error at iteration $t$ can be bounded as \begin{equation}\label{eq:theorem} \|\xx_t - \xx^\star\| \leq \underbrace{\max_{\lambda \in [\ell, L]} |P_t(\lambda)|}_{\defas r_t} \|\xx_0 - \xx^\star\|\,.\\ \end{equation} where $\ell$ and $L$ are the smallest and largest eigenvalues of $\HH$ respectively and $P_t$ is a $t$-th degree polynomial that we can write as \begin{align} P_t(\lambda) = \mom^{t/2} \left( {\small\frac{2\mom}{1+\mom}}\, T_t(\sigma(\lambda)) + {\small\frac{1 - \mom}{1 + \mom}}\,U_t(\sigma(\lambda))\right)\,. \end{align} where $\sigma(\lambda) = {\small\dfrac{1}{2\sqrt{\mom}}}(1 + \mom - \step\,\lambda)\,$ is a linear function that we'll refer to as the link function and $T_t$ and $U_t$ are the Chebyshev polynomials of the first and second kind respectively.

We'll refer to the maximum of this polynomial over the $[\ell, L]$ interval as the convergence rate and denote it $r_t$. For the class of problems in this post the convergence rate contains an exponential function of $t$ which dominates other terms for large $t$. For this reasons we will find it more convenient to focus instead on the rate factor, which is the limit superior as $t\to \infty$ of its $t$-th root, i.e., $\limsup_{t \to \infty} \sqrt[t]{r_t}$. The convergence rate factor provides a convenient way to compare the (asymptotic) speed of convergence, as it allows to ignore other (slower) terms that are negligible for large $t$.

This identity between momentum and Chebyshev polynomials opens the door to deriving the asymptotic properties of momentum from those of Chebyshev polynomials.

The Two Faces of Chebyshev Polynomials

Chebyshev polynomials behave very differently inside and outside the interval $[-1, 1]$. Inside this interval (shaded blue region) the magnitude of these polynomials stays close to zero, while outside it explodes:

The behavior of Chebyshev polynomials of the first and second kind is drastically different inside and outside the $[-1, 1]$ interval.

Open In Colab

Let's make this observation more precise.

Inside the $[-1, 1]$ interval, Chebyshev polynomials admit the trigonometric definitions $T_t(\cos(\theta)) = \cos(t \theta)$ and $U_{t}(\cos(\theta)) = \sin((t+1)\theta) / \sin(\theta)$ and so they have an oscillatory behavior with values bounded in absolute value by 1 and $t+1$ respectively.

Outside of this interval the Chebyshev polynomials of the first kind admit the explicit form for $|\xi| \ge 1$: \begin{align} T_t(\xi) &= \dfrac{1}{2} \Big(\xi-\sqrt{\xi^2-1} \Big)^t + \dfrac{1}{2} \Big(\xi+\sqrt{\xi^2-1} \Big)^t \\ U_t(\xi) &= \frac{(\xi + \sqrt{\xi^2 - 1})^{t+1} - (\xi - \sqrt{\xi^2 - 1})^{t+1}}{2 \sqrt{\xi^2 - 1}}\,. \end{align} Since we're interested in the root factor, we'll be interested in the asymptotics of the $t$-th root of these quantities.With little extra effort, it would be possible to derive non-asymptotic convergence rates, although I won't pursue this analysis here. Luckily, these asymptotics are the same for both polynomialsAlthough we won't use it here, this $t$-th root asymptotic holds for (almost) all orthogonal polynomials, not just Chebyshev polynomials. See for instance reference below and taking limits we have that \begin{equation} \lim_{t \to \infty} \sqrt[t]{|T_t(\xi)|} = \lim_{t \to \infty} \sqrt[t]{|U_t(\xi)|} = |\xi| + \sqrt{\xi^2 - 1}\,. \end{equation}

The Robust Region

Let's start first by considering the case in which the Chebyshev polynomials are only evaluated in the $[-1, 1]$. Since the Chebsyshev polynomials are evaluated at $\sigma(\cdot)$, this implies that $|\sigma(\lambda)| \leq 1$. We'll call the set of step-size and momentum parameters for which the previous inequality is verified the robust region.

Let's visualize this region in a map. Since $\sigma$ is a linear function, its extremal values are reached at the edges: \begin{equation} \max_{\lambda \in [\ell, L]} |\sigma(\lambda)| = \max\{|\sigma(\ell)|, |\sigma(L)|\}\,. \end{equation} Using $\ell \leq L$ and that $\sigma(\lambda)$ is decreasing in $\lambda$, we can simplify the condition that the above is smaller than one to $\sigma(\ell) \leq 1$ and $\sigma(L) \geq -1$, which in terms of the step-size and momentum correspond to: \begin{equation}\label{eq:robust_region} \step \leq \frac{(1 + \sqrt{\mom})^2}{L} \quad \text{ and }\quad \step \geq \frac{(1 - \sqrt{\mom})^2}{\ell}\,. \end{equation} These two conditions provide the upper and lower bound of the robust region.

The robust region –in gray– is upper bounded by $h = \frac{(1 + \sqrt{m})^2}{L}$ and lower bounded $h = \frac{(1 - \sqrt{m})^2 }{\ell}$.

Convergence rate factor

Let $\sigma(\lambda) = \cos(\theta)$ for some $\theta$, which we can do because $\sigma(\lambda) \in [-1, 1]$. In this regime, Chebyshev polynomials verify the identities $T_t(\cos(\theta)) = \cos(t \theta)$ and $U_t(\cos(\theta)) = \sin((t+1)\theta)/\sin(\theta)$ , which replacing in the definition of the residual polynomial gives \begin{equation} P_t(\sigma^{-1}(\cos(\theta))) = \mom^{t/2} \left[ {\small\frac{2\mom}{1+\mom}}\, \cos(t\theta) + {\small\frac{1 - \mom}{1 + \mom}}\,\frac{\sin((t+1)\theta)}{\sin(\theta)}\right]\,. \end{equation}

Since the expression inside the square brackets is bounded in absolute value by $t+2$, taking $t$-th root and then limits we have $\limsup_{t \to \infty} \sqrt[t]{|P_t(\sigma^{-1}(\cos(\theta)))|} = \mom^{t/2}$ for any $\theta$. This gives our first rate factor.

The rate factor in the robust region is $\sqrt{\mom}$.

This is nothing short of magical. It would seem natural –and this will be the case in other regions– that the speed of convergence should depend on both the step-size and the momentum parameter. Yet, this result implies that it's not the case in the robust region. In this region, the convergence only depends on the momentum parameter $\mom$. Amazing.This insensitivity to step-size has been leveraged by Zhang et al. 2017 to develop a momentum tuner

This also illustrates why we call this the robust region. In its interior, perturbing the step-size in a way that we stay within the region has no effect on the convergence rate. The next figure displays the rate factor (darker is faster) in the robust region.

In the robust region, the rate factor only depends on the momentum parameter.

The Lazy Region

Let's consider now what happens outside of the robust region. In this case, the convergence will depend on the largest of $\{|\sigma(\ell)|, |\sigma(L)|\}$. We'll consider first the case in which the maximum is $|\sigma(\ell)|$ and leave the other one for next section.

This region is determined by the inequalities $|\sigma(\ell)| > 1$ and $|\sigma(\ell)| \geq |\sigma(L)|$. Using the definition of $\sigma$ and solving for $h$ gives the equivalent conditions \begin{equation} \step \leq \frac{2 + 2\mom}{L - \ell} \quad \text{ and }\quad \step \leq \frac{(1 - \sqrt{\mom})^2}{\ell}\,. \end{equation} Note the second inequality is the same one as for the robust region \eqref{eq:robust_region} but with the inequality sign reversed. We'll call this the lazy region, as in increasing the momentum will take us out of it and into the robust region.

The lazy region depicted in gray in the left: lower bounded by zero and upper bounded by the minimum of $h = \frac{(1 - \sqrt{m})^2}{\ell}$ and $h = \frac{2 + 2m}{L - \ell}$.

Convergence rate factor

As we saw in Section 2, outside of the $[-1, 1]$ interval both Chebyshev have simple $t$-th root asymptotics. Using this and that both kinds of Chebyshev polynomials agree in sign outside of the $[-1, 1]$ interval we can compute the rate factor as \begin{align} \lim_{t \to \infty} \sqrt[t]{r_t} &= \sqrt{\mom} \lim_{t \to \infty} \sqrt[t]{ {\small\frac{2\mom}{\mom+1}}\, T_t(\sigma(\ell)) + {\small\frac{1 - \mom}{1 + \mom}}\,U_t(\sigma(\ell))} \\ &= \sqrt{\mom}\left(|\sigma(\ell)| + \sqrt{\sigma^2(\ell) - 1} \right)\\ \end{align} This gives the rate factor for this region

In the lazy region the convergence rate factor is $\sqrt{\mom}\left(|\sigma(\ell)| + \sqrt{\sigma^2(\ell) - 1} \right)$.

Unlike in the robust region, this rate depends on both the step-size and the momentum parameter, which enters in the rate through the link function $\sigma$. This can be observed in the color plot of the rate factor

In the lazy region (unlike in the robust region), the rate factor depends on both the step-size and momentum parameter.

The different regimes of Chebyshev polynomials creates a rich dynamics in the convergence of momentum that

Open In Colab

Knife's edge

The robust and lazy region occupy most (but not all!) of the region for which momentum converges. There's a small region that sits between the lazy and robust regions and the region where momentum diverges. We call this region the Knife's edge

For parameters not in the robust or lazy region, we have that $|\sigma(L)| > 1$ and $|\sigma(L)| > |\sigma(\ell)|$. Using the asymptotics of Chebyshev polynomials as we did in the previous section, we have that the rate factor is $\sqrt{\mom}\left(|\sigma(L)| + \sqrt{\sigma^2(L) - 1} \right)$. The method will only converge when this rate factor is below 1, which solving for $\step$ gives the condition $\step \leq 2 (1 + \mom)L$. This plus the condition of not being in the robust or lazy region gives the following characterization for this region: \begin{equation} \step \leq 2 (1 + \mom) L \quad \text{ and } \quad \step \geq \max\Big\{\tfrac{2(1 + \mom)}{L + \ell}, \tfrac{(1 + \sqrt{\mom})^2}{L}\Big\} \end{equation}

Knife's edge is upper bounded by $h = 2(1+m)L$ and lower bounded by $h = \max\Big\{\tfrac{2(1 + \mom)}{L + \ell}, \tfrac{(1 + \sqrt{\mom})^2}{L}\Big\}$.

Convergence rate factor

The rate factor can be computed using the same technique as in the lazy region. The resulting rate is the same as in that region but with $\sigma(L)$ replacing $\sigma(\ell)$:

In the Knife's edge region the convergence rate factor is $\sqrt{\mom}\left(|\sigma(L)| + \sqrt{\sigma^2(L) - 1} \right)$.

Pictorically, this corresponds to

Convergence in Knife's edge. As in the lazy region, the convergence in this region depends on both the step-size and momentum parameter.

Putting it all together

This is the end of our journey. We've visited all the regions on which momentum converges. The only thing left to do is to combine all the rate factors we've gathered along the way.

The convergence rate factor of momentum is \begin{equation} \limsup_{t \to \infty} \sqrt[t]{r_t} = \begin{cases} \sqrt{\mom} &\text{if }\step \in \big[\frac{(1 - \sqrt{\mom})^2}{\ell}, \frac{(1+\sqrt{\mom})^2}{L}\big]\\ \sqrt{\mom}(|\sigma(\ell)| + \sqrt{\sigma(\ell)^2 - 1}) &\text{if } \step \in \big[0, \min\{\frac{2 + 2\mom}{L - \ell}, \frac{(1 - \sqrt{\mom})^2}{\ell}\}\big]\\ \sqrt{\mom}(|\sigma(L)| + \sqrt{\sigma(L)^2 - 1})&\text{if } \step \in \big[\max\big\{\tfrac{2(1 + \mom)}{L + \ell}, \tfrac{(1 + \sqrt{\mom})^2}{L}\big\}, 2 (1 + \mom) L \big]\\ >1 \text{ (divergence)} & \text{ otherwise.} \end{cases} \end{equation}

Plotting the rate factors for all regions we can see that Polyak momentum (the method with momentum $\mom = \left(\frac{\sqrt{L} - \sqrt{\ell}}{\sqrt{L} + \sqrt{\ell}}\right)^2$ and step-size $\step = \left(\frac{2}{\sqrt{L} + \sqrt{\ell}}\right)^2$ which is asymptotically optimal among the momentum methods with constant coefficients) is at the intersection of the three regions.

Open In Colab

Acknowledgements. Thanks to Robert Gower for proofreading this post and providing thoughtful feedback. Thanks also to Courtney Paquette and Adrien Taylor for reporting errors.


References