The first method that we will describe predates the modern field of optimization, and was originally proposed as a method to solve systems of equations

Gradient Descent is an iterative method to solve the following unconstrained optimization problem

\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} f(\boldsymbol{x}) ~, \end{equation}

where $f$ is a differentiable function.

**Key idea**: approximate objective with a quadratic surrogate of the form
\begin{equation}
Q_t(\xx) = c_t + \boldsymbol{g}_t^T(\xx - \xx_t) + \frac{1}{2 \gamma}(\xx - \xx_t)^T (\xx - \xx_t)~.
\end{equation}

A reasonable condition to impose on this surrogate function is that it at $\xx_t$ it coincides with $f$ in its value and first derivative, that is: \begin{align} Q_t(\xx_t) &= f(\xx_t) \implies c_t = f(\xx_t)\\ \nabla Q_t(\xx_t) &= \nabla f(\xx_t) \implies \boldsymbol{g}_t = \nabla f(\xx_t) \end{align} Alternatively, you might also see $Q_t$ as the first-order Taylor expansion of $f$ at $\xx_t$. In all, we have the following surrogate function: \begin{equation} Q_t(\xx_t) = f(\xx_t) + \nabla f(\xx_t)^T(\xx - \xx_t) + \frac{1}{2 \gamma}(\xx - \xx_t)^T (\xx - \xx_t)~. \end{equation}

We can find the optimum of the function deriving and equating to zero. This way we find \begin{equation} \xx_{t+1} = \argmin_{\xx} Q_t(\xx) = \xx_t - \gamma \nabla f(\xx_t) \end{equation}

We can write the full gradient descent algorithm as follows. The algorithm only has one free parameter: the step size $\gamma$.

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size $\gamma > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma \nabla f(\xx_t)~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

We examine the convergence of gradient descent on three examples: a well-conditioned quadratic, an ill-conditioned quadratic and a non-convex function.

As we have seen in the previous examples, the convergence of gradient is extremely sensible to the choice of step size. The second and third example highlight the necessity of this step size to be *adaptive*: in regions of large variability of the gradient we would like to take small step sizes, while on regions with small variability we would like to take large step sizes.

*Backtracking line search* procedures allow to have select a step size depending on the current iterate and gradient. In this procedure, an initial (optimistic) step size $\gamma_t$ is selected and the following inequality (also known as sufficient decrease condition) evaluated
$$
f(\xx_t - \gamma_t \nabla f(\xx_t)) \leq f(\xx_t) - \frac{\gamma_t}{2}\|\nabla f(\xx_t)\|^2~.
$$
If this inequality is verified, the current step size is kept. If not, the step size is divided by 2 (or any number > 1) and the sufficient decrease condition evaluated until it is verified.

The Gradient Descent algorithm with backtracking line search then becomes

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, step size $\gamma > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad \textbf{Initial step size estimate }\gamma_t\\ & \quad\textbf{While}(true) \textbf{ do }\\ &\quad\quad \textbf{If } f(\xx_t - \gamma_t \nabla f(\xx_t)) \leq f(\xx_t) - \frac{\gamma_t}{2}\|\nabla f(\xx_t)\|^2\\ &\quad\quad\quad \textbf{break}\\ &\quad\quad\textbf{Else }: \gamma_t = \gamma_t / 2\\ &\quad\textbf{End While}\\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \gamma_t \nabla f(\xx_t)\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

The following examples show the convergence of gradient descent with the aforementioned backtracking line search strategy for the step size.