# The Stochastic Gradient Method

The Stochastic Gradient Method (SGM) is one of the most widely-used methods for large-scale optimization and is one of the main methods behind the current AI revolution. Herbert Robbins was an American mathematician and statistician. Together with Sutton Monro, invented the stochastic gradient method, also known as the Robbins–Monro method.

Stochastic Gradient Method (also known as stochastic gradient descent or SGD) can be used to solve optimization problems in which the objective function is of the form $f(\xx) = \mathbb{E}[f_i(\xx)]$, where the expectation is taken with respect to $i$. The most comnmon case is when $i$ can take a finite number of values, in which the problem becomes

\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} \frac{1}{n} \sum_{i=1}^n f_i(\boldsymbol{x}) ~, \end{equation}

This class of problems arises often in machine learning, where the $f_i$ is the cost of mis-classifying each element in the dataset.

The Stochastic Gradient Method can motivated as an approximation to gradient descent in which at each iteration we approximate the gradient as $\nabla f_i(\xx_t) \approx \nabla f(\xx_t)$:

## The Stochastic Gradient Method

We can write the full gradient descent algorithm as follows. The algorithm only has one free parameter: the step size $\gamma$.

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size sequence $\gamma_t > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\text{Choose $i \in \{1, 2, \ldots, n \}$ uniformly at random} \\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f_i(\xx_t)~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

SGM can be much more efficient than gradient descent in the case in which the objective consists of a large sum because at each iteration we only need to evaluate a partial gradient and not the full gradient.

## Step size

The choice of step size is one of the most delicate aspects of SGM. In this case, backtracking line search is not an option since it would involve to evaluate the objective function at each iteration, destroying the computational advantage of this method.

Two popular step size strategies exists for SGM, constant step size and decreasing step size

• In the constant step size strategy, $\gamma_t=\gamma$ for some pre-determined constant $\gamma$. The method converges very fast to neighborbood of a local minima and the bounces around. The radius of this neighborhood will depend on the step size.
• One can guarantee convergence to a local minimizer choosing a step size sequence that verifies $$\sum_{i=1}^\infty \gamma_t = \infty \text{ and } \sum_{i=1}^\infty \gamma_t^2 < \infty$$ The most popular sequence to verify this is $\gamma_t = \frac{C}{t}$ for some constant $C$. This is often referred to as a "decreasing step-size" sequence, although in fact the sequence does not need to be monotonically decreasing.

## Examples

A least squares problem can be written in the form acceptable by SGD since \begin{equation} \frac{1}{n}\|\boldsymbol{A} \xx - \boldsymbol{b}\|^2 = \frac{1}{n} \sum_{i=1}^n (\boldsymbol{A}_i^T \xx - \boldsymbol{b}_i)^2 \end{equation} where $\boldsymbol{A}_i$ is the $i$-th row of the matrix $\boldsymbol{A}$ .

Stochastic Gradient with constant step-size

## SGD with momentum

SGD is often augmented with a momentum term, so that the update in \eqref{eq:update_rule} becomes $$\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f_i(\xx_t) + \underbrace{\beta (\xx_t - \xx_{t-1})}_{\text{momentum}}$$

Momentum adds the previous update direction to the gradient. This has two beneficial effects:

• Stabilizing the update in case of outlayer gradients.
• Escaping regions of flat curvature.