The Stochastic Gradient Method (SGM)

Stochastic Gradient Method (also known as stochastic gradient descent or SGD) can be used to solve optimization problems in which the objective function is of the form $f(\xx) = \mathbb{E}[f_i(\xx)]$, where the expectation is taken with respect to $i$. The most comnmon case is when $i$ can take a finite number of values, in which the problem becomes

\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} \frac{1}{n} \sum_{i=1}^n f_i(\boldsymbol{x}) ~, \end{equation}

This class of problems arises often in machine learning, where the $f_i$ is the cost of mis-classifying each element in the dataset.

The Stochastic Gradient Method can motivated as an approximation to gradient descent in which at each iteration we approximate the gradient as $\nabla f_i(\xx_t) \approx \nabla f(\xx_t)$:

We can write the full gradient descent algorithm as follows. The algorithm only has one free parameter: the step size $\gamma$.

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size sequence $\gamma_t > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\text{Choose $i \in \{1, 2, \ldots, n \}$ uniformly at random} \\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f_i(\xx_t)~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

SGM can be much more efficient than gradient descent in the case in which the objective consists of a large sum because at each iteration we only need to evaluate a partial gradient and not the full gradient.

The choice of step size is one of the most delicate aspects of SGM. In this case, backtracking line search is not an option since it would involve to evaluate the objective function at each iteration, destroying the computational advantage of this method.

Two popular step size strategies exists for SGM, constant step size and decreasing step size

- In the constant step size strategy, $\gamma_t=\gamma$ for some pre-determined constant $\gamma$. The method converges very fast to neighborbood of a local minima and the bounces around. The radius of this neighborhood will depend on the step size
. - One can guarantee convergence to a local minimizer choosing a step size sequence that verifies $$ \sum_{i=1}^\infty \gamma_t = \infty \text{ and } \sum_{i=1}^\infty \gamma_t^2 < \infty $$ The most popular sequence to verify this is $\gamma_t = \frac{C}{t}$ for some constant $C$. This is often referred to as a "decreasing step-size" sequence, although in fact the sequence does not need to be monotonically decreasing.

A least squares problem can be written in the form acceptable by SGD since \begin{equation} \frac{1}{n}\|\boldsymbol{A} \xx - \boldsymbol{b}\|^2 = \frac{1}{n} \sum_{i=1}^n (\boldsymbol{A}_i^T \xx - \boldsymbol{b}_i)^2 \end{equation} where $\boldsymbol{A}_i$ is the $i$-th row of the matrix $\boldsymbol{A}$ .

Stochastic Gradient with constant step-size

Why does the SGM method converge despite its update being a very rough estimate of the gradient?. To answer this, we must first understand the *unbiasedness* property of its update.

**Unbiasedness of the SGM update**. Let $\mathbb{E}_t$ denote the expectation with respect to the choice of random sample ($i$) at iteration $t$. Then since the index $i$ is chosen *uniformly* at random we have
$$
\mathbb{E}_t \nabla f_{i_t}(\xx_t) = \sum_{i=1}^n \nabla f_{i}(\xx_t) P(i_t = i)
= \frac{1}{n}\sum_{i=1}^n \nabla f_{i}(\xx_t) = \nabla f(\xx_t)$$
This is the crucial property that makes SGD work. For a full proof, see e.g.