Optimization Nuggets: Implicit Bias of Gradient-based Methods

Losses with Unique Finite Root.

⊕
By Fabian Pedregosa.

Category: learning theory
#implicit bias #proofs #nuggets

Mon 10 January 2022

When an optimization problem has multiple global minima, different algorithms can find different solutions, a phenomenon often referred to as the implicit bias of optimization algorithms. In this post we'll characterize the implicit bias of gradient-based methods on a class of regression problems that includes linear least squares and Huber loss regression.

Consider the optimization problem where the objective function $f$ is a generalized linear model with data matrix $A \in \RR^{n \times p}$ and target vector $b \in \RR^n$. Denoting by $A_1, \ldots, A_n$ the row vectors of the data matrix, we can write this problem as \begin{equation}\label{eq:opt} \argmin_{x \in \RR^p} \left\{f(x) \defas \sum_{i=1}^n \varphi (A_i^\top x, b_i) \right\}\,, \end{equation} where $\varphi(z, y)$ is a differentiable real-valued function verifying the unique finite root condition, which is that it is only minimized for $z=y$. These losses are usually used for regression and includes the least squares $\varphi(z, y) = (z - y)^2$ and Huber losses. We'll further assume that the linear system $A x = b$ is under-specified, that is, that it admits more than one solution. This assumption is verified for example –but not limited to– over-parametrized models ($p > n$) with a full rank data matrix.

Problems of this form verify two key properties that make it easy to characterize the bias of gradient-based methods. By gradient-based methods I mean any method in which the updates are given by a linear combination of current and past gradients. This includes gradient descent, gradient descent with momentum, stochastic gradient descent (SGD), SGD with momentum, Nesterov's accelerated gradient method. It does not include however quasi-Newton methods or diagonally preconditioned methods such as Adagrad or Adam.

Property 1: Iterates remain in the span of the data. The gradient of the $i$-th sample $\varphi (A_i^\top x, b_i)$ has the same direction as its data sample $A_i$. If we denote by $\varphi'$ the derivative of $\varphi$ with respect to its first argument, then we have \begin{equation} \nabla_x \left[\varphi (A_i^\top x, b_i)\right] = \underbrace{\vphantom{\varphi_i'A_i}A_i}_{\text{vector}} \underbrace{\varphi'(A_i^\top x, b_i)}_{\text{scalar}} \,. \end{equation} This implies that any gradient-based method generates updates that stay in the span of the vectors $\{A_1, \ldots, A_n\}$.

It's no surprise then that the vector space generated by the samples $A_1, \ldots, A_n$ plays a crucial role here. For convenience we'll denote this subspace by \begin{equation} \spanA \defas \textcolor{colorspan}{\span\{A_1, \ldots, A_n\}} \,. \end{equation} Another subspace that will play a crucial role is the orthogonal complement of $\spanA$, which we denote $\spanAC$. We'll denote in red (blue) both the vector space $\spanA$ (its orthogonal complement $\spanAC$) and its elements.

Property 2: minimizers solve the linear system $A x = b$. By the unique root condition of $\varphi$, the global minimizer of \eqref{eq:opt} is achieved when $A_i^\top x = b_i$ for all $i$. In other words, the global minimizers are the solutions to the linear system $A x = b$, a set that is non-empty by the under-specification assumption.

With these ingredients we can characterize the implicit bias in the following Theorem:

Gradient-based methods started from $x_0$ converge to the solution with smallest distance to $x_0$. More precisely, assume that the iterates of a gradient-based method converge to a solution of \eqref{eq:opt}, and let $x_\infty \defas \lim_{t \to \infty} x_{t} $ denote this limit. Then $x_{\infty}$ can be expressed as \begin{equation} x_{\infty} = \textcolor{colorspan}{A^\dagger b} + \textcolor{colorcomplement}{(I - P)x_0}\,, \end{equation} where $^\dagger$ denotes the matrix pseudoinverse and $\textcolor{colorcomplement}{(I - P)}$ is the orthogonal projection onto the orthogonal complement of $\textcolor{colorspan}{\mathcal{A}}$. Furthermore, this is the solution with smallest distance to $x_0$, that is, \begin{equation}\label{eq:minimal_traveling} x_{\infty} = \argmin_{x \in \RR^p} ||x - x_0||_2 ~\text{ subject to } A x = b \,. \end{equation}

I've split the proof into two parts. The first part characterizes the limit iterate $x_{\infty}$, while the second part shows that the limit iterate solves a minimal distance problem.

Characterizing the limit iterate. The main argument here is to show that the limit iterate belongs to the intersection of two affine spaces and then compute their intersection. By Property 2, the limit iterate must solve the linear system $A x_{\infty} = b$. A classical linear algebra result states that all solutions of this problem are of the form $x + {\textcolor{colorcomplement}c}$, with $x$ any solution of $A x = b$ and $\textcolor{colorcomplement}{c \in \spanAC}$. Let's take $x = \textcolor{colorspan}{A^\dagger b}$ – where $^\dagger$ denotes the matrix pseudoinverse – so that we have \begin{equation} x_{\infty} = \textcolor{colorspan}{A^\dagger b} + {\textcolor{colorcomplement}c} \,,\quad \text{ for some }\textcolor{colorcomplement}{c \in \spanAC}\,. \end{equation} Let $P$ denote the orthogonal projection onto $\spanA$. Then we can decompose the initialization as $x_0 = \textcolor{colorspan}{P x_0} + \textcolor{colorcomplement}{(I - P)x_0}$. By the first property all updates are in $\spanA$, so the limit iterate can be written as \begin{equation} x_{\infty} = \textcolor{colorcomplement}{(I - P)x_0} + \textcolor{colorspan}{a} \,, \text{ for some }\textcolor{colorspan}{a \in \spanA}\,. \end{equation} Combining the previous two equations, we have that ${\textcolor{colorcomplement}c} = \textcolor{colorcomplement}{(I - P)x_0}$ and $\textcolor{colorspan}{a} = \textcolor{colorspan}{A^\dagger b}$. Hence we have arrived at the characterization \begin{equation}\label{eq:limit_characterization} x_{\infty} = \textcolor{colorspan}{A^\dagger b} + \textcolor{colorcomplement}{(I - P)x_0}\,. \end{equation}

The limit iterate has minimal distance. Let $x_{\star}$ denote the solution to \eqref{eq:minimal_traveling}, which is unique by the strong convexity of this problem. We aim to show that $x_{\star} = x_{\infty}$. Since $x_\star$ is also a solution to the linear system $A x_\star = b$, we must have $x_\star - x_{\infty} \in \spanAC$ and so \begin{align} \|x_{\star} - x_0\| &= \|\textcolor{colorcomplement}{x_{\star} - x_{\infty}} + x_{\infty}- x_0\|\\ &= \|\textcolor{colorcomplement}{x_{\star} - x_{\infty}} + \textcolor{colorspan}{A^\dagger b} - \textcolor{colorspan}{P x_0} \|\\ &= \sqrt{ \|\textcolor{colorcomplement}{x_{\star} - x_{\infty}} \|^2 + \| \textcolor{colorspan}{A^\dagger b} - \textcolor{colorspan}{P x_0}\|^2 }\,, \end{align} where the last identity follows by orthogonality. Since $x_\star$ minimizes the distance $\|x_{\star} - x_0\|$ , we must have $\textcolor{colorcomplement}{x_{\star} - x_{\infty}} = 0$, and so $x_\infty = x_\star$.

An immediate consequence of this Theorem is that when the initialization $x_0$ is in $\spanA$, then its projection onto $\spanAC$ is zero, and so from Eq. \eqref{eq:limit_characterization} we have $\textcolor{colorspan}{x_{\infty}} = \textcolor{colorspan}{A^\dagger b}$. This corresponds to the minimal norm solution, so we have

Under the same conditions as the theorem, if $x_0 \in \spanA$, then the limit iterate $x_{\infty}$ solves the minimal norm problem \begin{equation}\label{eq:minnorm} x_{\infty} = \argmin_{x \in \RR^p} ||x||_2 ~\text{ subject to } A x = b \,. \end{equation}

About the proof

I haven't been able to track down the origins of this proof. The first time I saw this argument was in Wilson's 2017 paper, although they don't claim of novelty of this result. A similar argument is also sketched in section 2.1 of Gunasekar et al. 2018.

Update 2022/02/17. Thanks to everyone that commented on twitter we could track down the origins of this result!. It seems one of the earliest references is this 1978 paper by Herman et al. who derives a similar result for the Kaczmarz method.

Acknowledgements. Thanks to Vlad Niculae, Baptiste Goujaud, Paul Vicol and Lechao Xiao for feedback on this post and reporting typos, and everyone that commented on the twitter thread with insights and related references. Check out also Mathurin Massias' beautiful take on this using duality.

Keep the gradient flowing

Optimization Nuggets: Implicit Bias of Gradient-based Methods

About the proof

References