# Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis

Category: optimization

#optimization #Frank-Wolfe

This blog post extends the convergence theory from the first part of my notes on the Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize and strengthen the convergence guarantees obtained in the first part.

Although FW is one of the oldest methods in nonlinear constrained optimization,

These results however gave convergence guarantees for the minimum over all iterates, and not on the last one, as is usual in the case of primal suboptimality. It is not until 2017, with the work of Nesterov,

In this post I will only explore primal-dual convergence guaratees. However, duality theory also reveals other connections between FW and other apparently disparate methods. For example, in the paper "Duality between subgradient and conditional gradient methods",

## Problem and notation

As in the first part of these notes, I will be discussing the FW method applied to the following general optimization problem

\begin{equation}\label{eq:fw_objective} \minimize_{\xx \in \mathcal{D}} f(\boldsymbol{x})~, \end{equation}

where $f$ is differentiable with $L$-Lipschitz gradient and the domain $\mathcal{D}$ is a convex and compact set the domain of $g$ is compact a compact subset of $\RR^d$. In this post, and unlike in the first part, we will also assume that $f$ is convex.

From an initial guess $\xx_0$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2, \ldots$ that converge towards the solution to \eqref{eq:fw_objective}. Below is the pseudo-code for this algorithm, as presented in the first part of these notes:

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\boldsymbol{s}_t = \argmax_{\boldsymbol{s} \in \mathcal{D}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle\label{eq:lmo}\\ &\quad \boldsymbol{d}_t = \ss_t - \xx_t\\ &\quad g_t = -\langle \nabla f(\xx_t), \dd_t \rangle\\ &\quad \textbf{If } g_t < \delta: \\ &\quad\qquad\hfill\text{// exit if gap is below tolerance }\nonumber\\ &\quad\qquad\textbf{return } \xx_t\\ &\quad {\textbf{Variant 1}}: \text{set step size as} \nonumber\\ &\quad\qquad\gamma_t = \vphantom{\sum_i}\min\Big\{\frac{g_t}{L\|\dd_t\|^2}, 1 \Big\}\label{eq:step_size}\\ &\quad \textbf{Variant 2}: \text{set step size by line search}\nonumber\\ &\quad\qquad\gamma_t = \argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma \boldsymbol{d}_t)\label{eq:line_search}\\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + \gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}

## Duality and Frank-Wolfe

In convex optimization, every problem can be paired with another, said to be *dual* to it. On the basis of this, close connections between otherwise disparate properties arise. The FW algorithm has deep and unexpected links with duality, which we will now explore.

The *dual problem* is a concave maximization problem that we can associate with our original convex minimization problem. Let $\imath_{\mathcal{D}}$ be the indicator function over $\mathcal{D}$, that is, the function that returns $0$ if the argument belongs to $\mathcal{D}$ and $+\infty$ otherwise.**Definition (convex conjugate).** Let $f: \RR^d \to [-\infty, \infty]$ be an extended real-valued function. Its convex conjugate, denoted $f^*$, is the function $f: \RR^d \to [-\infty, \infty]$ defined by
$$
f^{{*}}(\yy) \defas \sup_{\xx \in \RR^d} \left \{ \langle \yy , \xx \rangle - f \left( \xx \right) \right\}\,,\qquad \yy \in \RR^d\,.
$$*dual problem*. By extension, we will refer to $\psi$ as the *dual objective* or dual function.

The dual objective lower bounds the primal objective and so the difference between primal and dual objectives gives a bound on the suboptimality. In other words, let $\xx^\star$ be any solution to the original primal problem \eqref{eq:fw_objective}. Then by definition of dual objective we have
\begin{equation}
f(\xx) - f(\xx^\star) \leq f(\xx) - \psi(\uu)~,
\end{equation}
for any $\xx \in \mathcal{D}$ and any $\uu \in \dom(\psi)$.
This is quite exceptional, as gives a meaningful distance to optimum without need to know $\xx^\star$, which is of course unknown. This can then be used for instance as stopping criterion in optimization algorithms. The quantity $f(\xx) - \psi(\uu)$ is often referred to as the *primal-dual* gap.

However, computing the primal-dual gap involves evaluating the dual objective, which in the general case can be as costly as solving the original problem. What is special in the case of FW is that the primal-dual gap is given as a byproduct of the method. This is quite unique among optimization methods.

The next lemma shows that the primal-dual gap (for a specific choice of primal and dual variables) is exactly equal to the FW gap $g_t$.

**Lemma 2**. The FW gap $g_t$ is a duality gap at $\xx = \xx_t, \uu = \nabla f(\xx_t)$. More precisely, we have
\begin{equation}
g_t = f(\xx_t) - \psi(\nabla f(\xx_t))~.
\end{equation}

We have the following sequence of identities
\begin{align}
g_t &= \max_{\ss \in \mathcal{D}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle\\
& = \langle \nabla f(\xx_t), \xx_t\rangle + {\max_{\ss \in \RR^d}}\Big\{ \langle- \nabla f(\xx_t), \ss\rangle - \imath_{\mathcal{D}}(\ss)\Big\}\\
& = \langle \nabla f(\xx_t), \xx_t\rangle + \imath^*_{\mathcal{D}}(-\nabla f(\xx_t))\\
& = f(\xx_t) + \underbrace{\imath^*_{\mathcal{D}}(-\nabla f(\xx_t)) + f^*(\nabla f(\xx_t))}_{-\psi(\nabla f(\xx_t))}~,
\end{align}
where the first identity uses the definition of $\ss_t$, the second one the definition of convex conjugate and the last one is a consequence of the Fenchel-Young identity (see for example Proposition 16.10 of Bauschke and Combettes).

## Primal-dual Analysis

The next theorem gives an $\mathcal{O}(1/t)$ convergence rate for FW on the primal-dual gap and is the main result in this post.

**Theorem 3** . Let $\uu_t$ be defined recursively as $\uu_0 = \nabla f(\xx_0)$, $\uu_{t+1} = (1 - \xi_t)\uu_{t} + \xi_t\nabla f(\xx_t)$, with $\xi_t = 2 / (t + 2)$. Then we have:
\begin{equation}
f(\xx_t) - \psi(\uu_t) \leq \frac{2L\diam(\mathcal{A})^2}{t + 1} = \mathcal{O}\left(\frac{1}{t}\right)~.
\end{equation}

By the key recursive inequality we have the following inequality, valid for all $\xi_t \in [0, 1]$: \begin{align} f(\xx_{t+1}) &\leq f(\xx_t) - \xi_t\langle \nabla f(\xx_t), \dd_t\rangle + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2\\ &= (1 - \xi_t)f(\xx_t) + \xi_t \psi(\balpha_t) + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2~, \end{align} where the last identity follows by Lemma 2.

We now introduce the auxiliary variable $\sigma_t$. This is defined recursively as $\sigma_0 = \psi(\balpha_0)$, $\sigma_{t+1} = (1 - \xi_t) \sigma_t + \xi_t \psi(\balpha_t)$. Adding $\sigma_{t+1}$ on both sides of the previous equation gives \begin{equation} f(\xx_{t+1}) - \sigma_{t+1} \leq (1 - \xi)\left[f(\xx_t) - \sigma_t\right] + \frac{\xi^2 L}{2}\|\ss_t - \xx_t\|^2~. \end{equation}

Let $\xi_t = 2 / (t+2)$ and $e_t \defas \frac{t(t+1)}{2}(f(\xx_t) - \sigma_t)$. Multiplying the previous equation by $(t+1)(t+2)/2$ then gives \begin{align} e_{t+1} &\leq e_t + \frac{(t+1)}{(t+2)}L\|\ss_t - \xx_t\|^2\\ &\leq e_t + \diam(\mathcal{D})^2~. \end{align} Adding this last inequality from $0$ to $t-1$ gives \begin{equation}\label{eq:sublinear_bound_sigma} e_{t} \leq t\diam(\mathcal{D})^2 \implies f(\xx_t) - \sigma_t \leq \frac{2 L \diam(\mathcal{D})^2}{t+1} \end{equation}

In order to prove the claimed bound we now only need to prove that $- \psi(\uu_t) \leq - \sigma_t$. We will do this by induction. For $t=0$ we have $\psi(\uu_t) = \sigma_t$ by definition and so the bound is trivially verified. Suppose its true for $t$, then for $t+1$ we have \begin{align} - \psi(\uu_{t+1}) &= - \psi((1 - \xi_t)\uu_t + \xi_t \nabla f(\xx_t)) \\ &\leq - (1 - \xi_t)\psi(\uu_t) - \xi_t \psi(\nabla f(\xx_t))\\ &\leq - (1 - \xi_t)\sigma_t - \xi_t \psi(\nabla f(\xx_t))\\ &= - \sigma_{t+1} \end{align} where the first inequality is true by convexity of $-\psi$ and the second one by the induction hypothesis. Using this bound in \eqref{eq:sublinear_bound_sigma} yields the desired bound \begin{equation} f(\xx_{t}) - \psi(\uu_{t}) \leq \frac{2\diam(\mathcal{D})^2}{t+1} \end{equation}

### References

A heartfelt thanks to Nicolas Le Roux and Vlad Niculae for reporting typos on this post.