$$ \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} $$

Newton’s method

The Newton method is an iterative method to solve the following unconstrained optimization problem

\begin{equation}\label{eq:fw_objective} \minimize_{\boldsymbol{x} \in \RR^p} f(\boldsymbol{x}) ~, \end{equation}

when $f$ is twice differentiable.

In Newton's method, we approximate the objective with a quadratic surrogate of the form \begin{equation} Q_t(\xx) = c_t + \boldsymbol{g}_t^T(\xx - \xx_t) + \frac{1}{2 \gamma}(\xx - \xx_t)^T \boldsymbol{H}_t (\xx - \xx_t)~. \end{equation} Compared to gradient descent, the quadratic term is not fixed the be the identity but instead can incorporate an arbitrary (positive semi-definite) matrix $\boldsymbol{H}_t$.

A reasonable condition to impose on this surrogate function is that it at $\xx_t$ it coincides with $f$ in its value and first derivative, that is: \begin{align} Q_t(\xx_t) &= f(\xx_t) \implies c_t = f(\xx_t)\\ \nabla Q_t(\xx_t) &= \nabla f(\xx_t) \implies \boldsymbol{g}_t = \nabla f(\xx_t)\\ \nabla^2 Q_t(\xx_t) &= \nabla^2 f(\xx_t) \implies \boldsymbol{H}_t = \nabla^2 f(\xx_t)~, \end{align} where $\nabla^2 f(\xx_t)$ is the Hessian of $f$: \begin{equation} \nabla^2 f(\xx_t)= \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}~, \end{equation} where in this last equation the subindex refers to the entries in the vector and not to the iterates. Alternatively, you might also see $Q_t$ as the second-order Taylor expansion of $f$ at $\xx_t$. In all, we have the following surrogate function: \begin{equation} Q_t(\xx_t) = f(\xx_t) + \nabla f(\xx_t)^T (\xx - \xx) + \frac{1}{2 \gamma}(\xx - \xx_t)^T \nabla^2 f(\xx_t) (\xx - \xx_t)~. \end{equation}

We can find the optimum of the function deriving and equating to zero. This way we find (assuming the Hessian is invertible) \begin{equation} \xx_{t+1} = \argmin_{\xx} Q_t(\xx) = \xx_t - \gamma [\nabla^2 f(\xx_t)]^{-1} \nabla f(\xx_t) \end{equation}

Where applicable, Newton's method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood such that, if we start within this neighborbood, Newton's method with step size $\gamma=1$ converges quadratically assuming the Hessian is invertible and Lipschitz continuous.

The biggest drawback of Newton's method is that finding the inverse of the Hessian in high dimensions can be an expensive operation. In such cases, instead of directly inverting the Hessian it's better to calculate the vector $\boldsymbol\Delta$ as the solution to the system of linear equations $$\nabla^2 f(\mathbf{x}_t) \boldsymbol{\Delta} = -\nabla f(\mathbf{x}_t) $$ which may be solved by various factorizations or approximately (but to great accuracy).

Examples

Step-size α = 1
In this case the approximation is exact and we converge on a single iteration.
Step-size α = 1
Step-size α = 1
When the Hessian is close to singular there might be some numerical instabilities.

Next: Quasi-Newton

Previous: Gradient Descent