The Stochastic Gradient Method

The Stochastic Gradient Method (SGM) is one of the most widely-used methods for large-scale optimization and is one of the main methods behind the current AI revolution.

Herbert Robbins was an American mathematician and statistician. Together with Sutton Monro, invented the stochastic gradient method, also known as the Robbins–Monro method.

Stochastic Gradient Method (also known as stochastic gradient descent or SGD) can be used to solve optimization problems in which the objective function is of the form $f(\xx) = \mathbb{E}[f_i(\xx)]$, where the expectation is taken with respect to $i$. The most comnmon case is when $i$ can take a finite number of values, in which the problem becomes

\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} \frac{1}{n} \sum_{i=1}^n f_i(\boldsymbol{x}) ~, \end{equation}

This class of problems arises often in machine learning, where the $f_i$ is the cost of mis-classifying each element in the dataset.

The Stochastic Gradient Method can motivated as an approximation to gradient descent in which at each iteration we approximate the gradient as $\nabla f_i(\xx_t) \approx \nabla f(\xx_t)$:

The Stochastic Gradient Method

We can write the full gradient descent algorithm as follows. The algorithm only has one free parameter: the step size $\gamma$.

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size sequence $\gamma_t > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\text{Choose $i \in \{1, 2, \ldots, n \}$ uniformly at random} \\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f_i(\xx_t)~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

SGM can be much more efficient than gradient descent in the case in which the objective consists of a large sum because at each iteration we only need to evaluate a partial gradient and not the full gradient.

Step size

The choice of step size is one of the most delicate aspects of SGM. In this case, backtracking line search is not an option since it would involve to evaluate the objective function at each iteration, destroying the computational advantage of this method.

Two popular step size strategies exists for SGM, constant step size and decreasing step size

Examples

A least squares problem can be written in the form acceptable by SGD since \begin{equation} \frac{1}{n}\|\boldsymbol{A} \xx - \boldsymbol{b}\|^2 = \frac{1}{n} \sum_{i=1}^n (\boldsymbol{A}_i^T \xx - \boldsymbol{b}_i)^2 \end{equation} where $\boldsymbol{A}_i$ is the $i$-th row of the matrix $\boldsymbol{A}$ .

Stochastic Gradient with constant step-size

Step-size α = 0.2
Stochastic Gradient with constant step size converges quickly to a neighborhood of the optimum, but then bounces around.
Step-size α = 0.2
Step-size α = 0.02
Stochastic Gradient with decreasing step sizes is quite robust to the choice of step size. On one hand there is really no good way to set the step size (e.g., no equivalent of line search for Gradient Descent) but on the other hand it converges for a wide range of step sizes.

Convergence

Why does the SGM method converge despite its update being a very rough estimate of the gradient?. To answer this, we must first understand the unbiasedness property of its update.

Unbiasedness of the SGM update. Let $\mathbb{E}_t$ denote the expectation with respect to the choice of random sample ($i$) at iteration $t$. Then since the index $i$ is chosen uniformly at random we have $$ \mathbb{E}_t \nabla f_{i_t}(\xx_t) = \sum_{i=1}^n \nabla f_{i}(\xx_t) P(i_t = i) = \frac{1}{n}\sum_{i=1}^n \nabla f_{i}(\xx_t) = \nabla f(\xx_t)$$ This is the crucial property that makes SGD work. For a full proof, see e.g.