Optimization Nuggets: Exponential Convergence of SGD
This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution.
The problem is to find a minimizer $x_{\star}$ of a function $f$, which can be expressed as the expectation of another function $F$: \begin{equation} x_{\star} \in \argmin_{x \in \RR^p} \left\{ f(x) \defas \EE_{\xi}F(x; \xi) = \int_{\mathcal{S}} F(x; \xi)\dif P(\xi)\right\}\,, \end{equation} with access to $\nabla F(x; \xi)$, which is the gradient of $F$ with respect to the first argument. Here, $\mathcal{S}$ denotes the sample space, and $\xi \sim P$ is an $\mathcal{S}$-valued random variable.
SGD is one of the most popular methods to solve this problem. At each iteration, it samples $\xi_t \sim P$ and generates the next iterate $x_{t+1}$ according to the recursion \begin{equation} x_{t+1} = x_t - \gamma_t \nabla F(x_t; \xi_t) \,. \end{equation} where $\gamma_t > 0$ is a step-size parameter. While the most classical setup for SGD is to consider the step-size parameter $\gamma_t$ to be decreasing as a function of the iteration counter $t$, here we'll consider instead the constant step-size $\gamma_t = \gamma$.
My goal in this post is to prove that SGD converges exponentially fast using standard assumptions.
Assumptions. We make 3 assumptions that quantify the 3 constants that appear in the rate: smoothness, curvature and noise.
The first two assumptions are the classical $\smooth$-smoothness and $\cvx$-strong convexity properties, pervasive in optimization. This is the class of functions where the scalar product $\langle\nabla F(x; \xi) - \nabla F(y; \xi), x - y\rangle$ is lower and upper bounded by a quadratic: \begin{equation} \cvx \|x - y\|^2 \leq \langle \nabla F(x; \xi) - \nabla F(y; \xi), x - y\rangle \leq \smooth \|x - y\|^2\,. \end{equation}
These two inequalities can be combined into the following one
The third assumption is that the variance of the gradient at optimum is bounded. That is, there exists a finite scalar $\noise$ such that \begin{equation} \EE_{\xi} \|\nabla F(x_{\star}; \xi)\|^2 \leq \noise^2\,. \end{equation}
With these ingredients, we can show that the expected error of SGD can be bounded by a sum of two terms, one of which converges exponentially fast in the iteration number, while the second term is independent of the iteration counter.
Theorem. Let $\stepsize \leq \frac{1}{\cvx + \smooth}$ and let $\harmonic \defas \frac{2 \cvx \smooth}{\cvx + \smooth}$ denote the harmonic mean of $\cvx$ and $\smooth$. Then the expected iterate suboptimality converges exponentially fast to a $2\frac{\stepsize}{\harmonic}\noise^2$-radius ball centered at the solution. More precisely, at iteration $t$ we have the following bound: \begin{align} \EE\, \|x_{t} - x_\star\|^2 \leq \underbrace{\vphantom{\sum_i}\left(1 - \stepsize \harmonic \right)^{t}\,\|x_0 - x_\star\|^2}_{\text{exponential decrease}} + \underbrace{\vphantom{\sum_i}2\frac{\stepsize}{\harmonic}\noise^2}_{\text{stationary}}\,. \end{align}
Using the definition of $x_{t+1}$ and expanding the square we have
\begin{align}
\|x_{t+1} - x_\star\|^2 &= \|x_t - x_\star\|^2 - 2 \stepsize \langle \nabla F(x_t; \xi_t) - \nabla F(x_\star; \xi_t) + \nabla F(x_\star; \xi_t), x_t - x_\star\rangle \nonumber\\
&\qquad + \stepsize^2 \|\nabla F(x_t; \xi_t) -\nabla F(x_\star; \xi_t) + \nabla F(x_\star; \xi_t)\|^2 \\
&\leq \left( 1 - \stepsize \harmonic \right)\|x_t - x_\star\|^2 + 2\stepsize\left[\stepsize - \tfrac{1}{\cvx + \smooth}\right]\|\nabla F(x_t; \xi_t) - \nabla F(x_\star; \xi_t)\|^2 \nonumber\\
& \qquad - 2 \stepsize \langle \nabla F(x_\star; \xi_t), x_t - x_\star\rangle + 2 \stepsize^2 \| \nabla F(x_\star; \xi_t)\|^2\,,
\end{align}
where in the first inequality we have added and subtracted $\nabla F(x_\star; \xi_t)$, and in the second one we have used Eq. \eqref{eq:smooth_cvx} and the inequality $\|a + b\|^2 \leq 2 \|a\|^2 + 2 \|b\|^2$ on the last term.
Taking conditional expectations of the previous expression, the last term can be bounded by $2 \stepsize^2 \noise^2$, while the term before vanishes by optimality of $x_\star$. Furthermore, the condition $\stepsize \leq \frac{1}{\cvx + \smooth}$ ensures the quantity inside the square brackets is negative so we can drop that term. All in all, we have \begin{equation} \EE_{\xi_t | x_t} \|x_{t+1} - x_\star\|^2 \leq \left( 1 - \stepsize \harmonic \right)\|x_t - x_\star\|^2 + 2 \stepsize^2 \noise^2\,. \end{equation}
We now take full expectations and unroll the previous inequality until $t=0$, which allows to bound the expected suboptimality by the sum of an exponential decreasing term and a geometric series \begin{align} \EE \|x_{t+1} - x_\star\|^2 \leq \left( 1 - \stepsize \harmonic \right)^{t+1}\|x_0 - x_\star\|^2 + 2 \stepsize^2 \noise^2 \sum_{i=0}^t \left( 1 - \stepsize \harmonic \right)^i\,. \end{align} The condition $\stepsize \leq \frac{1}{\cvx + \smooth} \leq \frac{1}{\harmonic}$ also ensures that $0 \leq 1 - \stepsize \harmonic \leq 1$, so geometric series in the last term is monotonically increasing in $t$ and with a finite limit. The last step is to bound this geometric series by its limit: \begin{equation} \sum_{i=0}^t \left( 1 - \stepsize \harmonic \right)^i \leq \sum_{i=0}^\infty \left( 1 - \stepsize \harmonic \right)^i = \frac{1}{\stepsize \harmonic}\,. \end{equation} Plugging this last inequality into the previous one gives the desired bound.
Implications for interpolating models.
For models in which $\nabla F(x_{\star}; \xi) = 0$, we have that the noise parameter $\noise$ is zero and so SGD converges exponentially fast to the global optimum. The condition $\nabla F(x_{\star}; \xi) = 0$ is often referred to as the interpolation condition (intuitively, this condition states that all partial losses $F(\cdot, \xi)$ admit the same minimizer) and has been recently leveraged to design backtracking and adaptive step-size schedulers.
In this case, the above theorem does not have the stationary term, and implies a global exponential convergence. As far as I know, this was first noted by Schmidt and Le Roux.
Acknowledgements. Thanks to Konstantin Mishchenko for spotting numerous typos and pointing out I was wrongly using the term "over-parametrized models" (now corrected), Baptiste Goujaud for reporting typos and making very useful suggestions, Robert Gower and Bart van Merriënboer for feedback on the post.