Keep the gradient flowing

Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis

This blog post extends the convergence theory from the first part of these notes on the Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize and strengthen the convergence guarantees obtained in the first part.

$$ \def\aa{\boldsymbol a} \def\balpha{\boldsymbol \alpha} \def\bmu{\boldsymbol \mu} \def\bb{\boldsymbol b} \def\cc{\boldsymbol c} \def\dd{\boldsymbol d} \def\xx{\boldsymbol x} \def\zz{\boldsymbol z} \def\vv{\boldsymbol v} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\TT{\boldsymbol T} \def\Econd{\boldsymbol E} \def\CC{\boldsymbol C} \def\AA{\boldsymbol A} \def\RR{\mathbb R} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\dom}{\mathbf{dom}} \DeclareMathOperator*{\diam}{\mathbf{diam}} \DeclareMathOperator*{\Fix}{\mathbf{Fix}} \DeclareMathOperator{\prox}{\mathbf{prox}} \def\defas{\stackrel{\text{def}}{=}} $$

Introduction

Although FW is one of the oldest methods in nonlinear constrained optimization, the development of primal-dual guarantees is relatively recent. In 2010, Clarkson gave primal-dual convergence rates in the particular case in which the domain is the simplex. These results were later extended by Jaggi to arbitrary domains.

These results however gave convergence guarantees for the minimum over all iterates, and not on the last one, as is usual in the case of primal suboptimality. It is not until 2017, with the work of Nesterov, that primal-dual convergence rates on the last iterate were obtained. These primal-dual guarantees were recently extended by myself and coauthors to other FW variant like Away-steps FW, Pairwise FW and FW with backtracking line search.

In this post I will only explore primal-dual convergence guaratees. However, duality theory also reveals other connections between FW and other apparently disparate methods. For example, in the paper "Duality between subgradient and conditional gradient methods", Bach showed an equivalence between the FW algorithm and mirror descent on a dual problem.

Problem and Notation

As in the first part of these notes, I will be discussing the FW method applied to the following general optimization problem

\begin{equation}\label{eq:fw_objective} \argmin_{\xx \in \RR^d}\, \left\{\mathcal{P}(\xx) \defas f(\boldsymbol{x}) + \imath_{\mathcal{C}}(\xx)\right\}~, \end{equation}

where $f$ is differentiable with $L$-Lipschitz gradient and $\imath_{\mathcal{C}}$ is the indicator function over a convex and compact domain $\mathcal{C}$, that is, the function that returns $0$ if the argument belongs to $\mathcal{C}$ and $+\infty$ otherwise. \eqref{eq:fw_objective} is equivalent to the constrained problem $\argmin_{\xx \in \mathcal{C}} f(\xx)$ that I used in the first part of these notes, although for our purposes the formulation with the indicator function will be more convenient. the domain $\mathcal{C}$ is a convex and compact set the domain of $g$ is compact a compact subset of $\RR^d$. In this post we will also assume that $f$ is convex.

From an initial guess $\xx_0$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2, \ldots$ that converge towards the solution to \eqref{eq:fw_objective}. Below is the pseudo-code for this algorithm, as presented in the first part of these notes:

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad \balpha_t = \nabla f(\boldsymbol{x}_t)\\ &\quad\boldsymbol{s}_t \in {\textstyle\argmax_{\boldsymbol{s} \in \mathcal{C}}} \langle - \balpha_t, \boldsymbol{s}\rangle\label{eq:lmo}\\ &\quad \boldsymbol{d}_t = \ss_t - \xx_t\\ &\quad g_t = -\langle \balpha_t, \dd_t \rangle\\ &\quad \textbf{If } g_t < \delta: \\ &\quad\qquad\hfill\text{// exit if gap is below tolerance }\nonumber\\ &\quad\qquad\textbf{return } \xx_t\\ &\quad {\textbf{Variant 1}}: \text{set step size as} \nonumber\\ &\quad\qquad\gamma_t=\vphantom{\sum_i}\min\Big\{\frac{g_t}{L\|\dd_t\|^2}, 1 \Big\}\label{eq:step_size}\\ &\quad \textbf{Variant 2}: \text{set step size by line search}\nonumber\\ &\quad\qquad\gamma_t=\argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma \boldsymbol{d}_t)\label{eq:line_search}\\ &\quad\boldsymbol{x}_{t+1}=\boldsymbol{x}_t + \gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

Duality and Frank-Wolfe

In convex optimization, every problem can be paired with another, said to be dual to it. On the basis of this, close connections between otherwise disparate properties arise. The FW algorithm has deep and unexpected links with duality, which we will now explore.

The dual problem is a concave maximization problem that we can associate with our original convex minimization problem. Then by definition of convex conjugateDefinition (convex conjugate). Let $f: \RR^d \to [-\infty, \infty]$ be an extended real-valued function. Its convex conjugate, denoted $f^*$, is the function $f: \RR^d \to [-\infty, \infty]$ defined by $$ f^{{*}}(\yy) \defas \sup_{\xx \in \RR^d} \left \{ \langle \yy , \xx \rangle - f \left( \xx \right) \right\}\,,\qquad \yy \in \RR^d\,. $$ we then have the following sequence of equivalences that starting from the the original (primal) problem: \begin{align} &\min_{\xx \in \RR^d} \mathcal{P}(\xx) = \min_{\xx \in \RR^d} f(\xx) + \imath_{\mathcal{C}}(\xx)\\ &= \min_{\xx \in \RR^d} \max_{\balpha \in \RR^d}\Big\{\langle \xx, \balpha\rangle - f^*(\balpha)\Big\} + \imath_{\mathcal{C}}(\xx)\\ &= \max_{\balpha \in \RR^d}\min_{\xx \in \RR^d}\Big\{ \imath_{\mathcal{C}}(\xx) + \langle \xx,\balpha\rangle\Big\} - f^*(\balpha)\\ &= \max_{\balpha \in \RR^d}\underbrace{-\imath_{\mathcal{C}}^*(-\balpha) - f^*(\balpha)}_{\defas \mathcal{D}(\balpha)}~,\label{eq:pd_relationship} \end{align} where in the second identity we have used Sion's minimax theorem to exchange the $\max$ and $\min$. Note also that we can use $\max$ instead of the usual $\sup$ in the definition of Fenchel conjugate because of the smoothness of $f$, which implies strong convexity of $f^*$. We will call \eqref{eq:pd_relationship}, which is a concave maximization problem, the dual problem. By extension, we will refer to $\mathcal{D}$ as the dual objective or dual function.

The dual objective lower bounds the primal objective and so the difference between primal and dual objectives gives a bound on the suboptimality. In other words, let $\xx^\star$ be any solution to the original primal problem \eqref{eq:fw_objective}. Then by definition of dual objective we have \begin{equation} f(\xx) - f(\xx^\star) \leq f(\xx) - \mathcal{D}(\balpha)~, \end{equation} for any $\xx \in \mathcal{C}$ and any $\balpha \in \dom(\mathcal{C})$. This is quite exceptional, as gives a meaningful distance to optimum without need to know $\xx^\star$, which is of course unknown. This can then be used for instance as stopping criterion in optimization algorithms. The quantity $f(\xx) - \mathcal{D}(\balpha)$ is often referred to as the primal-dual gap.

However, computing the primal-dual gap involves evaluating the dual objective, which in the general case can be as costly as solving the original problem. What is special in the case of FW is that the primal-dual gap is given as a byproduct of the method. This is quite unique among optimization methods.

The next lemma shows that the primal-dual gap (for a specific choice of primal and dual variables) is exactly equal to the FW gap $g_t$.

Lemma 2. The FW gap $g_t$ is a duality gap: \begin{equation} g_t = \mathcal{P}(\xx_t) - \mathcal{D}(\nabla f(\xx_t))~. \end{equation}

We have the following sequence of identities \begin{align} g_t &= \max_{\ss \in \mathcal{C}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle\\ & = \langle \nabla f(\xx_t), \xx_t\rangle + {\max_{\ss \in \RR^d}}\Big\{ \langle- \nabla f(\xx_t), \ss\rangle - \imath_{\mathcal{C}}(\ss)\Big\}\\ & = \langle \nabla f(\xx_t), \xx_t\rangle + \imath^*_{\mathcal{C}}(-\nabla f(\xx_t))\\ & = f(\xx_t) + \underbrace{\imath^*_{\mathcal{C}}(-\nabla f(\xx_t)) + f^*(\nabla f(\xx_t))}_{=-\mathcal{D}(\nabla f(\xx_t))}~, \end{align} where the first identity uses the definition of $\ss_t$, the second one the definition of convex conjugate and the last one is a consequence of the Fenchel-Young identity (see for example Proposition 16.10 of Bauschke and Combettes).

Duality Gap Convergence Rate

The next theorem gives an $\mathcal{O}(1/t)$ convergence rate for FW on the duality gap and is the main result in this post.A similar but more general theorem which supports non exact oracles and other step sizes is given in Theorem 3 of this paper. Note that since the dual gap $\mathcal{P}(\xx_t) - \mathcal{D}(\nabla f(\xx_t) )$ is an upper bound on the primal suboptimality $f(\xx_t) - f(\xx^\star)$, this is strictly stronger and implies the convergence rates we got in Theorem 2 in first part of these notes

Theorem 3 . Let $\bmu$ be defined recursively as $\bmu_0 = \nabla f(\xx_0)$, $\bmu_{t+1} = (1 - \xi_t)\bmu_t + \xi_t\nabla f(\xx_t)$, with $\xi_t = 2 / (t + 2)$. Then we have: \begin{equation} \mathcal{P}(\xx_t) - \mathcal{D}(\bmu_t) \leq \frac{2L\diam(\mathcal{C})^2}{t + 1} = \mathcal{O}\left(\frac{1}{t}\right)~. \end{equation}

By the key recursive inequality we have the following inequality, valid for all $\xi_t \in [0, 1]$: \begin{align} \mathcal{P}(\xx_{t+1}) &\leq \mathcal{P}(\xx_t) - \xi_t g_t + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2\\ &= (1 - \xi_t)\mathcal{P}(\xx_t) + \xi_t \mathcal{D}(\balpha_t) + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2~, \end{align} where the last identity follows by Lemma 2.

For the dual objectve, by Jensen's inequality we have $\mathcal{D}(\bmu_{t+1}) \geq (1 - \xi_t)\mathcal{D}(\bmu_t) + \xi_t \mathcal{D}(\balpha_t)$. Adding this to the previous inequality we have \begin{equation} \mathcal{P}(\xx_{t+1}) - \mathcal{D}(\bmu_{t+1}) \leq (1 - \xi_t)(\mathcal{P}(\xx_t) - \mathcal{D}(\bmu_t)) + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2~ \end{equation}

Let $\xi_t = 2 / (t+2)$ and $e_t \defas \frac{t(t+1)}{2}(f(\xx_t) - \mathcal{D}(\bmu_t))$. Multiplying the previous equation by $(t+1)(t+2)/2$ then gives \begin{align} e_{t+1} &\leq e_t + \frac{(t+1)}{(t+2)}L\|\ss_t - \xx_t\|^2\\ &\leq e_t + L\diam(\mathcal{C})^2~. \end{align} Adding this last inequality from $0$ to $t-1$ gives \begin{equation}\label{eq:sublinear_bound_sigma} e_{t} \leq e_0 + t L \diam(\mathcal{C})^2 \implies f(\xx_t) - \mathcal{D}(\bmu_t) \leq \frac{2 L \diam(\mathcal{C})^2}{t+1} \end{equation}

Citing

If this post has been useful to you, please consider citing its associated paper that contains the above analysis and much more, including backtracking line-search and other Frank-Wolfe variants:

    @inproceedings{pedregosa2020linearly,
      title={Linearly Convergent Frank-Wolfe with Backtracking Line-Search},
      author={Pedregosa, Fabian and Negiar, Geoffrey and Askari, Armin and Jaggi, Martin},
      booktitle={Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics},
      year={2020},
      url={https://arxiv.org/pdf/1806.05123.pdf}
    }
  


References


Thanks. A heartfelt thanks to Nicolas Le Roux, Vlad Niculae and Konstantin Mishchenko for reporting typos on this post.