The Stochastic Gradient Method (SGM)
Stochastic Gradient Method (also known as stochastic gradient descent or SGD) can be used to solve optimization problems in which the objective function is of the form $f(\xx) = \mathbb{E}[f_i(\xx)]$, where the expectation is taken with respect to $i$. The most comnmon case is when $i$ can take a finite number of values, in which the problem becomes
\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} \frac{1}{n} \sum_{i=1}^n f_i(\boldsymbol{x}) ~, \end{equation}
This class of problems arises often in machine learning, where the $f_i$ is the cost of mis-classifying each element in the dataset.
The Stochastic Gradient Method can motivated as an approximation to gradient descent in which at each iteration we approximate the gradient as $\nabla f_i(\xx_t) \approx \nabla f(\xx_t)$:
We can write the full gradient descent algorithm as follows. The algorithm only has one free parameter: the step size $\gamma$.
\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size sequence $\gamma_t > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\text{Choose $i \in \{1, 2, \ldots, n \}$ uniformly at random} \\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f_i(\xx_t)~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}
SGM can be much more efficient than gradient descent in the case in which the objective consists of a large sum because at each iteration we only need to evaluate a partial gradient and not the full gradient.
The choice of step size is one of the most delicate aspects of SGM. In this case, backtracking line search is not an option since it would involve to evaluate the objective function at each iteration, destroying the computational advantage of this method.
Two popular step size strategies exists for SGM, constant step size and decreasing step size
A least squares problem can be written in the form acceptable by SGD since \begin{equation} \frac{1}{n}\|\boldsymbol{A} \xx - \boldsymbol{b}\|^2 = \frac{1}{n} \sum_{i=1}^n (\boldsymbol{A}_i^T \xx - \boldsymbol{b}_i)^2 \end{equation} where $\boldsymbol{A}_i$ is the $i$-th row of the matrix $\boldsymbol{A}$ .
Stochastic Gradient with constant step-size
SGD is often augmented with a momentum term, so that the update in \eqref{eq:update_rule} becomes $$ \boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f_i(\xx_t) + \underbrace{\beta (\xx_t - \xx_{t-1})}_{\text{momentum}} $$
Momentum adds the previous update direction to the gradient. This has two beneficial effects: