Keep the gradient flowing

Optimization Nuggets: Stochastic Polyak Step-size, Part 2

Faster rates under strong convexity

This blog post discusses the convergence rate of the Stochastic Gradient Descent with Stochastic Polyak Step-size (SGD-SPS) algorithm for minimizing a finite sum objective. Building upon the proof of the previous post, we show that the convergence rate can be improved to O(1/t) under the additional assumption that the objective function is strongly convex.

$$ \require{mathtools} \require{color} \def\aa{\boldsymbol a} \def\rr{\boldsymbol r} \def\AA{\boldsymbol A} \def\HH{\boldsymbol H} \def\EE{\mathbb E} \def\II{\boldsymbol I} \def\CC{\boldsymbol C} \def\DD{\boldsymbol D} \def\KK{\boldsymbol K} \def\eeps{\boldsymbol \varepsilon} \def\tr{\text{tr}} \def\LLambda{\boldsymbol \Lambda} \def\bb{\boldsymbol b} \def\cc{\boldsymbol c} \def\xx{\boldsymbol x} \def\zz{\boldsymbol z} \def\uu{\boldsymbol u} \def\vv{\boldsymbol v} \def\qq{\boldsymbol q} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\pp{\boldsymbol p} \def\lmax{L} \def\lmin{\ell} \def\RR{\mathbb{R}} \def\TT{\boldsymbol T} \def\QQ{\boldsymbol Q} \def\CC{\boldsymbol C} \def\Econd{\boldsymbol E} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\dom}{\mathbf{dom}} \DeclareMathOperator*{\Fix}{\mathbf{Fix}} \DeclareMathOperator{\prox}{\mathbf{prox}} \DeclareMathOperator{\span}{\mathbf{span}} \def\defas{\stackrel{\text{def}}{=}} \def\dif{\mathop{}\!\mathrm{d}} \definecolor{colormomentum}{RGB}{27, 158, 119} \def\cvx{{\color{colormomentum}\mu}} \definecolor{color1}{RGB}{127,201,127} \definecolor{color2}{RGB}{179,226,205} \definecolor{color3}{RGB}{253,205,172} \definecolor{color4}{RGB}{203,213,232} \definecolor{colorstepsize}{RGB}{217,95,2} \def\stepsize{{\color{color3}{\boldsymbol{\gamma}}}} \def\harmonic{{\color{colorstep2}\boldsymbol{h}}} \def\cvx{{\color{colorcvx}\boldsymbol{\mu}}} \def\smooth{{\color{colorsmooth}\boldsymbol{L}}} \def\noise{{\color{colornoise}\boldsymbol{\sigma}}} $$

Faster Convergence under Strong Convexity

Shortly after I published my last post on the convergence of the stochastic Polyak step-size, my name buddy Fabian Schaipp pointed out that the convergence rate can be improved to $\mathcal{O}(1/t)$ under strong convexity of the objective.

small update: using the same techniques + the proof technique from @HazanPrinceton 's paper, you can also show 1/t convergence for strongly convex, nonsmooth. You need to assume bounded gradients *only for a bounded set*. pic.twitter.com/fDvSmkHjd9

— Fabian Schaipp (@FSchaipp) October 13, 2023

This faster rate uses the same technique that Elad Hazan and Sham Kakade used to analyze the deterministic Polyak step-size. This proof is rather short and elegant, so I decided to write another blog post about it with Fabian Schaipp.

Stochastic Gradient Descent with Stochastic Polyak Step-size

We'll consider the same setting as in the previous post. We aim to minimize a finite sum objective $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$, where each $f_i$ is star-convex and $f$ is $\mu$-strongly convex. The SGD with stochastic Polyak step-size (SGD-SPS) algorithmThe variant described here was proposed by (Garrigos et al. 2023) under the name $\text{SPS}_+$. is defined by the following recursion: \begin{equation}\label{eq:sps} \begin{aligned} & \text{sample uniformly at random } i \in \{1, \ldots, n\}\\ & \stepsize_t = \frac{1}{\|\nabla f_i(x_t)\|^2}(f_i(x_t) - f_i(x_\star))_+ \\ & x_{t+1} = x_t - \stepsize_t \nabla f_i(x_t) \end{aligned} \end{equation} where $x_\star = \arg \min f(x)$ is the solution to the problem (in this case the solution is unique because of strong convexity).

$\mathcal{O}(1/t)$ rate under strong convexity

While the previous post showed an $\mathcal{O}(1/\sqrt{t})$ convergence rate under star-convexity, this post shows that the convergence rate can be improved to $\mathcal{O}(1/t)$ under the additional assumption that $f$ (but not necessarily the $f_i$'s ) are strongly convex.

Without further ado, here is the main result of this post.

Assume that $f$ is $\mu$-strongly convex and $f_i$ is star-convex around the minimizer $x_\star$ for all $i$. Furthermore, we'll also assume that the subgradients are bounded in the ball $\mathcal{B}$ with center $x_\star$ and radius $\|x_0 - x_\star\|$, that is, we have $\|\nabla f_i(x)\|\leq G$ for all $i$ and $x \in \mathcal{B}$. Then, SGD-SPS converges in expected error at a rate of at least $O(1/(T+1))$. That is, after $T$ steps we have \begin{equation} \EE \|x_T - x_\star\|^2 \leq \frac{4 G^2}{\mu^2 (T+1)} \,. \end{equation}

From the main recursive inequality proven in Eq. (9) of the previous post, we have that \begin{equation}\label{eq:key_inequality} \EE(f(x_t) - f(x_\star))^2 \leq G^2(\EE\|x_t - x_\star\|^2 - \EE\|x_{t+1} - x_\star\|^2) \end{equation} Strong convexity of $f$ implies that $f(x_t) - f(x_\star) \geq \frac{\mu}{2}\|x_t - x_\star\|^2$. Plugging this into the previous inequality and grouping terms we have \begin{equation} \EE\|x_{t+1} - x_\star\|^2\leq \EE\|x_t - x_\star\|^2\left(1 - \frac{\mu^2}{4 G^2}\EE\|x_t - x_\star\|^2\right) \,. \end{equation} Let $a_t \defas \frac{\mu^2}{4 G^2}\EE\|x_t - x_\star\|^2$. Multiplying both sides of the previous equation by $\frac{\mu^2}{4 G^2}$ we have \begin{equation} a_{t+1} \leq a_t (1 - a_t) \,. \end{equation} We'll now prove by induction that the inequality above implies $a_t \leq \frac{1}{t+1}$ for all $t$. For the base case $t=0$ we have by strong convexity and the boundedness of the gradients that \begin{equation} a_0 = \frac{\mu^2}{4 G^2}\EE\|x_0 - x_\star\|^2 \leq \frac{1}{4 G^2}\|\nabla f(x_0)\|^2 \leq \frac{1}{4} \,. \end{equation} For $t=1$, we have $a_1 \leq a_0(1 - a_0)$ with $a_0 \leq 1$. As $x \mapsto x (1 - x)$ has a maximal value $\frac{1}{4}$ over the interval $[0, 1]$ we have $a_1 \leq \frac{1}{4}$. For the induction step, assume that $a_{t-1} \leq \frac{1}{t}$. Then, for $t\geq 2$, we have \begin{align} a_t &\leq a_{t-1}(1 - a_{t-1}) \leq \max_{x \in [0, 1/t]} x (1 - x) \\ &= \frac{1}{t}(1 - \frac{1}{t}) = \frac{1}{t+1} \frac{t^2 - 1}{t^2} \leq \frac{1}{t+1} \,. \end{align} We have hence proven that $a_t \leq \frac{1}{t+1}$ for all $t$. Plugging this back into the definition of $a_t$ and multiplying both sides by $\frac{4 G^2}{\mu^2}$ yields the desired result.

It's important to note that in the previous result the bounded subgradients assumption is restricted to the bounded set $\mathcal{B}$. We could make this assumption instead of the more common bounded in the domain assumption thanks to the error monotonicity proven in Eq. (5) of the previous post. Without this result, we would have to assume bounded subgradients on the whole space, which is a contradictory assumption with strong convexity (in other words, the set of functions that verify both bounded subgradients on the whole space and strong convexity is empty).

Citing

If you find this blog post useful, please consider citing as

Stochastic Polyak Step-size, Faster rates under strong convexity, Fabian Pedregosa and Fabian Schaipp, 2023

with bibtex entry:

  
    @misc{pedregosa2023sps2,
      title={Stochastic Polyak Step-size, Faster rates under strong convexity},
      author={Pedregosa, Fabian and Schaipp, Fabian},
      howpublished = {\url{http://fa.bianp.net/blog/2023/sps2/}},
      year={2023}
    }
  
  

References