Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis
This blog post extends the convergence theory from the first part of these notes on the Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize and strengthen the convergence guarantees obtained in the first part.
Introduction
Although FW is one of the oldest methods in nonlinear constrained optimization,
These results however gave convergence guarantees for the minimum over all iterates, and not on
the last one, as is usual in the case of primal
suboptimality. It is not until 2017, with the work of Nesterov,
In this post I will only explore primal-dual convergence guaratees. However, duality theory
also reveals other connections between FW and other apparently disparate methods. For example,
in the paper "Duality between subgradient and conditional gradient methods",
Problem and Notation
As in the first part of these notes, I will be discussing the FW method applied to the following general optimization problem
\begin{equation}\label{eq:fw_objective} \argmin_{\xx \in \RR^d}\, \left\{\mathcal{P}(\xx) \defas f(\boldsymbol{x}) + \imath_{\mathcal{C}}(\xx)\right\}~, \end{equation}
where $f$ is differentiable with $L$-Lipschitz gradient and $\imath_{\mathcal{C}}$ is the
indicator function over a convex and compact domain $\mathcal{C}$, that is, the function that
returns $0$ if the argument belongs to $\mathcal{C}$ and $+\infty$ otherwise.
From an initial guess $\xx_0$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2, \ldots$ that converge towards the solution to \eqref{eq:fw_objective}. Below is the pseudo-code for this algorithm, as presented in the first part of these notes:
\begin{align}
&\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber\\
& \textbf{For }t=0, 1, \ldots \textbf{ do } \\
&\quad \balpha_t = \nabla f(\boldsymbol{x}_t)\\
&\quad\boldsymbol{s}_t \in {\textstyle\argmax_{\boldsymbol{s} \in \mathcal{C}}} \langle -
\balpha_t, \boldsymbol{s}\rangle\label{eq:lmo}\\
&\quad \boldsymbol{d}_t = \ss_t - \xx_t\\
&\quad g_t = -\langle \balpha_t, \dd_t \rangle\\
&\quad \textbf{If } g_t < \delta: \\ &\quad\qquad\hfill\text{// exit if gap is below
tolerance }\nonumber\\ &\quad\qquad\textbf{return } \xx_t\\ &\quad {\textbf{Variant
1}}: \text{set step size as} \nonumber\\
&\quad\qquad\gamma_t=\vphantom{\sum_i}\min\Big\{\frac{g_t}{L\|\dd_t\|^2}, 1
\Big\}\label{eq:step_size}\\ &\quad \textbf{Variant 2}: \text{set step size by line
search}\nonumber\\ &\quad\qquad\gamma_t=\argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma
\boldsymbol{d}_t)\label{eq:line_search}\\ &\quad\boldsymbol{x}_{t+1}=\boldsymbol{x}_t +
\gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ &
\textbf{return } \xx_t \end{align}
Duality and Frank-Wolfe
In convex optimization, every problem can be paired with another, said to be dual to it. On the basis of this, close connections between otherwise disparate properties arise. The FW algorithm has deep and unexpected links with duality, which we will now explore.
The dual problem is a concave maximization problem that we can associate with our
original convex minimization problem. Then by definition of convex conjugate
The dual objective lower bounds the primal objective and so the difference between primal and dual objectives gives a bound on the suboptimality. In other words, let $\xx^\star$ be any solution to the original primal problem \eqref{eq:fw_objective}. Then by definition of dual objective we have \begin{equation} f(\xx) - f(\xx^\star) \leq f(\xx) - \mathcal{D}(\balpha)~, \end{equation} for any $\xx \in \mathcal{C}$ and any $\balpha \in \dom(\mathcal{C})$. This is quite exceptional, as gives a meaningful distance to optimum without need to know $\xx^\star$, which is of course unknown. This can then be used for instance as stopping criterion in optimization algorithms. The quantity $f(\xx) - \mathcal{D}(\balpha)$ is often referred to as the primal-dual gap.
However, computing the primal-dual gap involves evaluating the dual objective, which in the general case can be as costly as solving the original problem. What is special in the case of FW is that the primal-dual gap is given as a byproduct of the method. This is quite unique among optimization methods.
The next lemma shows that the primal-dual gap (for a specific choice of primal and dual variables) is exactly equal to the FW gap $g_t$.
Lemma 2. The FW gap $g_t$ is a duality gap: \begin{equation} g_t = \mathcal{P}(\xx_t) - \mathcal{D}(\nabla f(\xx_t))~. \end{equation}
We have the following sequence of identities
\begin{align}
g_t &= \max_{\ss \in \mathcal{C}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle\\
& = \langle \nabla f(\xx_t), \xx_t\rangle + {\max_{\ss \in \RR^d}}\Big\{ \langle-
\nabla f(\xx_t), \ss\rangle - \imath_{\mathcal{C}}(\ss)\Big\}\\
& = \langle \nabla f(\xx_t), \xx_t\rangle + \imath^*_{\mathcal{C}}(-\nabla f(\xx_t))\\
& = f(\xx_t) + \underbrace{\imath^*_{\mathcal{C}}(-\nabla f(\xx_t)) +
f^*(\nabla f(\xx_t))}_{=-\mathcal{D}(\nabla f(\xx_t))}~,
\end{align}
where the first identity uses the definition of $\ss_t$, the second one the definition of
convex conjugate and the last one is a consequence of the Fenchel-Young identity (see for
example Proposition 16.10 of Bauschke and Combettes).
Duality Gap Convergence Rate
The next theorem gives an $\mathcal{O}(1/t)$ convergence rate for FW on the duality gap and
is the main result in this post.
Theorem 3 . Let $\bmu$ be defined recursively as $\bmu_0 = \nabla f(\xx_0)$, $\bmu_{t+1} = (1 - \xi_t)\bmu_t + \xi_t\nabla f(\xx_t)$, with $\xi_t = 2 / (t + 2)$. Then we have: \begin{equation} \mathcal{P}(\xx_t) - \mathcal{D}(\bmu_t) \leq \frac{2L\diam(\mathcal{C})^2}{t + 1} = \mathcal{O}\left(\frac{1}{t}\right)~. \end{equation}
By the key recursive inequality we have the following inequality, valid for all $\xi_t \in [0, 1]$: \begin{align} \mathcal{P}(\xx_{t+1}) &\leq \mathcal{P}(\xx_t) - \xi_t g_t + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2\\ &= (1 - \xi_t)\mathcal{P}(\xx_t) + \xi_t \mathcal{D}(\balpha_t) + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2~, \end{align} where the last identity follows by Lemma 2.
For the dual objectve, by Jensen's inequality we have $\mathcal{D}(\bmu_{t+1}) \geq (1 - \xi_t)\mathcal{D}(\bmu_t) + \xi_t \mathcal{D}(\balpha_t)$. Adding this to the previous inequality we have \begin{equation} \mathcal{P}(\xx_{t+1}) - \mathcal{D}(\bmu_{t+1}) \leq (1 - \xi_t)(\mathcal{P}(\xx_t) - \mathcal{D}(\bmu_t)) + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2~ \end{equation}
Let $\xi_t = 2 / (t+2)$ and $e_t \defas \frac{t(t+1)}{2}(f(\xx_t) - \mathcal{D}(\bmu_t))$. Multiplying the previous equation by $(t+1)(t+2)/2$ then gives \begin{align} e_{t+1} &\leq e_t + \frac{(t+1)}{(t+2)}L\|\ss_t - \xx_t\|^2\\ &\leq e_t + L\diam(\mathcal{C})^2~. \end{align} Adding this last inequality from $0$ to $t-1$ gives \begin{equation}\label{eq:sublinear_bound_sigma} e_{t} \leq e_0 + t L \diam(\mathcal{C})^2 \implies f(\xx_t) - \mathcal{D}(\bmu_t) \leq \frac{2 L \diam(\mathcal{C})^2}{t+1} \end{equation}
Citing
If this post has been useful to you, please consider citing its associated paper that contains the above analysis and much more, including backtracking line-search and other Frank-Wolfe variants:
@inproceedings{pedregosa2020linearly, title={Linearly Convergent Frank-Wolfe with Backtracking Line-Search}, author={Pedregosa, Fabian and Negiar, Geoffrey and Askari, Armin and Jaggi, Martin}, booktitle={Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics}, year={2020}, url={https://arxiv.org/pdf/1806.05123.pdf} }