$$ \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} $$

Gradient Descent

The first method that we will describe predates the modern field of optimization, and was originally proposed as a method to solve systems of equations.

Augustin-Louis Cauchy was a French mathematician and physicist who made pioneering contributions to mathematical analysis. Motivated by the need to solve "large" quadratic problems (6 variables) that arise in Astronomy, he invented the method of gradient descent. Today, this method is used to comfortably solve a problems with thousands of variables.

Motivation

Gradient Descent is an iterative method to solve the following unconstrained optimization problem

\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} f(\boldsymbol{x}) ~, \end{equation}

where $f$ is a differentiable function.

Key idea: approximate objective with a quadratic surrogate of the form \begin{equation} Q_t(\xx) = c_t + \boldsymbol{g}_t^T(\xx - \xx_t) + \frac{1}{2 \gamma}(\xx - \xx_t)^T (\xx - \xx_t)~. \end{equation}

A reasonable condition to impose on this surrogate function is that it at $\xx_t$ it coincides with $f$ in its value and first derivative, that is: \begin{align} Q_t(\xx_t) &= f(\xx_t) \implies c_t = f(\xx_t)\\ \nabla Q_t(\xx_t) &= \nabla f(\xx_t) \implies \boldsymbol{g}_t = \nabla f(\xx_t) \end{align} Alternatively, you might also see $Q_t$ as the first-order Taylor expansion of $f$ at $\xx_t$. In all, we have the following surrogate function: \begin{equation} Q_t(\xx_t) = f(\xx_t) + \nabla f(\xx_t)^T(\xx - \xx_t) + \frac{1}{2 \gamma}(\xx - \xx_t)^T (\xx - \xx_t)~. \end{equation}

We can find the optimum of the function deriving and equating to zero. This way we find \begin{equation} \xx_{t+1} = \argmin_{\xx} Q_t(\xx) = \xx_t - \gamma \nabla f(\xx_t) \end{equation}

The Gradient Descent Algorithm

We can write the full gradient descent algorithm as follows. The algorithm only has one free parameter: the step size $\gamma$.

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size $\gamma > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma \nabla f(\xx_t)~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

Examples

We examine the convergence of gradient descent on three examples: a well-conditioned quadratic, an ill-conditioned quadratic and a non-convex function.

Step-size α = 0.2
On a well-conditioned quadratic function, the gradient descent converges on few iterations to the optimum.
Step-size α = 0.02
On a badly-conditioned quadratic function, the gradient descent converges takes many more iterations to converge than on the above well-conditioned problem. This is because gradient descent requires a much smaller step size on this problem to converge.
Step-size α = 0.02
Gradient descent also converges on a badly-conditioned non-convex problem. Convergence is slow in this case.

On the choice of step size and line search

As we have seen in the previous examples, the convergence of gradient is extremely sensible to the choice of step size. The second and third example highlight the necessity of this step size to be adaptive: in regions of large variability of the gradient we would like to take small step sizes, while on regions with small variability we would like to take large step sizes.

Backtracking line search procedures allow to have select a step size depending on the current iterate and gradient. In this procedure, an initial (optimistic) step size $\gamma_t$ is selected and the following inequality (also known as sufficient decrease condition) evaluated $$ f(\xx_t - \gamma_t \nabla f(\xx_t)) \leq f(\xx_t) - \frac{\gamma_t}{2}\|\nabla f(\xx_t)\|^2~. $$ If this inequality is verified, the current step size is kept. If not, the step size is divided by 2 (or any number > 1) and the sufficient decrease condition evaluated until it is verified.

The Gradient Descent algorithm with backtracking line search then becomes

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size $\gamma > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad \textbf{Initial step size estimate }\gamma_t\\ & \quad\textbf{While}(true) \textbf{ do }\\ &\quad\quad \textbf{If } f(\xx_t - \gamma_t \nabla f(\xx_t)) \leq f(\xx_t) - \frac{\gamma_t}{2}\|\nabla f(\xx_t)\|^2\\ &\quad\quad\quad \textbf{break}\\ &\quad\quad\textbf{Else }: \gamma_t = \gamma_t / 2\\ &\quad\textbf{End While}\\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f(\xx_t)\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

The following examples show the convergence of gradient descent with the aforementioned backtracking line search strategy for the step size.

On a well-conditioned quadratic function, the gradient descent converges on few iterations to the optimum. Adding the backtracking line search strategy for the step size does not change much in this case.
In this example we can clearly see the effect of the backtracking line search strategy: once the algorithm in a region of low curvature, it can take larger step sizes. The final result is a much improved convergence compared to the fixed step-size equivalent.
The backtracking line search also improves convergence on non-convex problems.

Next: Newton's method

Previous: Introduction