fa.bianp.net

Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis

Category: optimization
#optimization #Frank-Wolfe

This blog post extends the convergence theory from the first part of my notes on the Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize and strengthen the convergence guarantees obtained in the first part.

$$ \def\aa{\boldsymbol a} \def\balpha{\boldsymbol \alpha} \def\bb{\boldsymbol b} \def\cc{\boldsymbol c} \def\dd{\boldsymbol d} \def\xx{\boldsymbol x} \def\zz{\boldsymbol z} \def\uu{\boldsymbol u} \def\vv{\boldsymbol v} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\TT{\boldsymbol T} \def\Econd{\boldsymbol E} \def\CC{\boldsymbol C} \def\AA{\boldsymbol A} \def\RR{\mathbb R} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\dom}{\mathbf{dom}} \DeclareMathOperator*{\diam}{\mathbf{diam}} \DeclareMathOperator*{\Fix}{\mathbf{Fix}} \DeclareMathOperator{\prox}{\mathbf{prox}} \def\defas{\stackrel{\text{def}}{=}} $$

Although FW is one of the oldest methods in nonlinear constrained optimization, the development of primal-dual guarantees is relatively recent. In 2010, Clarkson gave primal-dual convergence rates in the particular case in which the domain is the simplex. These results were later extended by Jaggi to arbitrary domains.

These results however gave convergence guarantees for the minimum over all iterates, and not on the last one, as is usual in the case of primal suboptimality. It is not until 2017, with the work of Nesterov, that primal-dual convergence rates on the last iterate were obtained. These primal-dual guarantees were recently extended by myself and coauthors to other FW variant like Away-steps FW, Pairwise FW and FW with backtracking line search.

In this post I will only explore primal-dual convergence guaratees. However, duality theory also reveals other connections between FW and other apparently disparate methods. For example, in the paper "Duality between subgradient and conditional gradient methods", Bach showed an equivalence between the FW algorithm and mirror descent on a dual problem.

Problem and notation

As in the first part of these notes, I will be discussing the FW method applied to the following general optimization problem

\begin{equation}\label{eq:fw_objective} \minimize_{\xx \in \mathcal{D}} f(\boldsymbol{x})~, \end{equation}

where $f$ is differentiable with $L$-Lipschitz gradient and the domain $\mathcal{D}$ is a convex and compact set the domain of $g$ is compact a compact subset of $\RR^d$. In this post, and unlike in the first part, we will also assume that $f$ is convex.

From an initial guess $\xx_0$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2, \ldots$ that converge towards the solution to \eqref{eq:fw_objective}. Below is the pseudo-code for this algorithm, as presented in the first part of these notes:

\begin{align} &\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber\\ & \textbf{For }t=0, 1, \ldots \textbf{ do } \\ &\quad\boldsymbol{s}_t = \argmax_{\boldsymbol{s} \in \mathcal{D}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle\label{eq:lmo}\\ &\quad \boldsymbol{d}_t = \ss_t - \xx_t\\ &\quad g_t = -\langle \nabla f(\xx_t), \dd_t \rangle\\ &\quad \textbf{If } g_t < \delta: \\ &\quad\qquad\hfill\text{// exit if gap is below tolerance }\nonumber\\ &\quad\qquad\textbf{return } \xx_t\\ &\quad {\textbf{Variant 1}}: \text{set step size as} \nonumber\\ &\quad\qquad\gamma_t = \vphantom{\sum_i}\min\Big\{\frac{g_t}{L\|\dd_t\|^2}, 1 \Big\}\label{eq:step_size}\\ &\quad \textbf{Variant 2}: \text{set step size by line search}\nonumber\\ &\quad\qquad\gamma_t = \argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma \boldsymbol{d}_t)\label{eq:line_search}\\ &\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + \gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ & \textbf{return } \xx_t \end{align}


Duality and Frank-Wolfe

In convex optimization, every problem can be paired with another, said to be dual to it. On the basis of this, close connections between otherwise disparate properties arise. The FW algorithm has deep and unexpected links with duality, which we will now explore.

The dual problem is a concave maximization problem that we can associate with our original convex minimization problem. Let $\imath_{\mathcal{D}}$ be the indicator function over $\mathcal{D}$, that is, the function that returns $0$ if the argument belongs to $\mathcal{D}$ and $+\infty$ otherwise.It is sometimes more convenient to specify the constraints of our problem through the indicator function. For example, our original optimization problem $\min_{\xx \in \mathcal{D}}f(\xx)$ can be equivalently written as the following unconstrained problem using the indicator function: $\min_{\xx \in \RR^p} f(\xx) + \imath_{\mathcal{D}}(\xx)$. Then by definition of convex conjugateDefinition (convex conjugate). Let $f: \RR^d \to [-\infty, \infty]$ be an extended real-valued function. Its convex conjugate, denoted $f^*$, is the function $f: \RR^d \to [-\infty, \infty]$ defined by $$ f^{{*}}(\yy) \defas \sup_{\xx \in \RR^d} \left \{ \langle \yy , \xx \rangle - f \left( \xx \right) \right\}\,,\qquad \yy \in \RR^d\,. $$ we then have the following sequence of equivalences that starting from the the original (primal) problem: \begin{align} \min_{\xx \in \mathcal{D}} f(\xx) &= \min_{\xx \in \RR^d} f(\xx) + \imath_{\mathcal{D}}(\xx)\\ &= \min_{\xx \in \RR^d} \max_{\uu \in \RR^d}\Big\{\langle \xx, \uu\rangle - f^*(\uu)\Big\} + \imath_{\mathcal{D}}(\xx)\\ &= \max_{\uu \in \RR^d}\min_{\xx \in \RR^d}\Big\{ \imath_{\mathcal{D}}(\xx) + \langle \xx,\uu\rangle\Big\} - f^*(\uu)\\ &= \max_{\uu \in \RR^d}\underbrace{-\imath_{\mathcal{D}}^*(-\uu) - f^*(\uu)}_{\defas \psi(\uu)}~,\label{eq:pd_relationship} \end{align} where in the second identity we have used Sion's minimax theorem to exchange the $\max$ and $\min$. Note also that we can use $\max$ instead of the usual $\sup$ in the definition of Fenchel conjugate because of the smoothness of $f$, which implies strong convexity of $f^*$. We will call \eqref{eq:pd_relationship}, which is a concave maximization problem, the dual problem. By extension, we will refer to $\psi$ as the dual objective or dual function.

The dual objective lower bounds the primal objective and so the difference between primal and dual objectives gives a bound on the suboptimality. In other words, let $\xx^\star$ be any solution to the original primal problem \eqref{eq:fw_objective}. Then by definition of dual objective we have \begin{equation} f(\xx) - f(\xx^\star) \leq f(\xx) - \psi(\uu)~, \end{equation} for any $\xx \in \mathcal{D}$ and any $\uu \in \dom(\psi)$. This is quite exceptional, as gives a meaningful distance to optimum without need to know $\xx^\star$, which is of course unknown. This can then be used for instance as stopping criterion in optimization algorithms. The quantity $f(\xx) - \psi(\uu)$ is often referred to as the primal-dual gap.

However, computing the primal-dual gap involves evaluating the dual objective, which in the general case can be as costly as solving the original problem. What is special in the case of FW is that the primal-dual gap is given as a byproduct of the method. This is quite unique among optimization methods.

The next lemma shows that the primal-dual gap (for a specific choice of primal and dual variables) is exactly equal to the FW gap $g_t$.

Lemma 2. The FW gap $g_t$ is a duality gap at $\xx = \xx_t, \uu = \nabla f(\xx_t)$. More precisely, we have \begin{equation} g_t = f(\xx_t) - \psi(\nabla f(\xx_t))~. \end{equation}

We have the following sequence of identities \begin{align} g_t &= \max_{\ss \in \mathcal{D}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle\\ & = \langle \nabla f(\xx_t), \xx_t\rangle + {\max_{\ss \in \RR^d}}\Big\{ \langle- \nabla f(\xx_t), \ss\rangle - \imath_{\mathcal{D}}(\ss)\Big\}\\ & = \langle \nabla f(\xx_t), \xx_t\rangle + \imath^*_{\mathcal{D}}(-\nabla f(\xx_t))\\ & = f(\xx_t) + \underbrace{\imath^*_{\mathcal{D}}(-\nabla f(\xx_t)) + f^*(\nabla f(\xx_t))}_{-\psi(\nabla f(\xx_t))}~, \end{align} where the first identity uses the definition of $\ss_t$, the second one the definition of convex conjugate and the last one is a consequence of the Fenchel-Young identity (see for example Proposition 16.10 of Bauschke and Combettes).

Primal-dual Analysis

The next theorem gives an $\mathcal{O}(1/t)$ convergence rate for FW on the primal-dual gap and is the main result in this post.A similar but more general theorem which supports non exact oracles and other step sizes is given in Theorem 2 of my recent paper. Note that since the dual gap $f(\xx_t) - \psi(\uu_t)$ is an upper bound on the primal suboptimality $f(\xx_t) - f(\xx^\star)$, this is strictly stronger and implies the convergence rates we got in Theorem 2 in first part of these notes

Theorem 3 . Let $\uu_t$ be defined recursively as $\uu_0 = \nabla f(\xx_0)$, $\uu_{t+1} = (1 - \xi_t)\uu_{t} + \xi_t\nabla f(\xx_t)$, with $\xi_t = 2 / (t + 2)$. Then we have: \begin{equation} f(\xx_t) - \psi(\uu_t) \leq \frac{2L\diam(\mathcal{A})^2}{t + 1} = \mathcal{O}\left(\frac{1}{t}\right)~. \end{equation}

By the key recursive inequality we have the following inequality, valid for all $\xi_t \in [0, 1]$: \begin{align} f(\xx_{t+1}) &\leq f(\xx_t) - \xi_t\langle \nabla f(\xx_t), \dd_t\rangle + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2\\ &= (1 - \xi_t)f(\xx_t) + \xi_t \psi(\balpha_t) + \frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2~, \end{align} where the last identity follows by Lemma 2.

We now introduce the auxiliary variable $\sigma_t$. This is defined recursively as $\sigma_0 = \psi(\balpha_0)$, $\sigma_{t+1} = (1 - \xi_t) \sigma_t + \xi_t \psi(\balpha_t)$. Adding $\sigma_{t+1}$ on both sides of the previous equation gives \begin{equation} f(\xx_{t+1}) - \sigma_{t+1} \leq (1 - \xi)\left[f(\xx_t) - \sigma_t\right] + \frac{\xi^2 L}{2}\|\ss_t - \xx_t\|^2~. \end{equation}

Let $\xi_t = 2 / (t+2)$ and $e_t \defas \frac{t(t+1)}{2}(f(\xx_t) - \sigma_t)$. Multiplying the previous equation by $(t+1)(t+2)/2$ then gives \begin{align} e_{t+1} &\leq e_t + \frac{(t+1)}{(t+2)}L\|\ss_t - \xx_t\|^2\\ &\leq e_t + \diam(\mathcal{D})^2~. \end{align} Adding this last inequality from $0$ to $t-1$ gives \begin{equation}\label{eq:sublinear_bound_sigma} e_{t} \leq t\diam(\mathcal{D})^2 \implies f(\xx_t) - \sigma_t \leq \frac{2 L \diam(\mathcal{D})^2}{t+1} \end{equation}

In order to prove the claimed bound we now only need to prove that $- \psi(\uu_t) \leq - \sigma_t$. We will do this by induction. For $t=0$ we have $\psi(\uu_t) = \sigma_t$ by definition and so the bound is trivially verified. Suppose its true for $t$, then for $t+1$ we have \begin{align} - \psi(\uu_{t+1}) &= - \psi((1 - \xi_t)\uu_t + \xi_t \nabla f(\xx_t)) \\ &\leq - (1 - \xi_t)\psi(\uu_t) - \xi_t \psi(\nabla f(\xx_t))\\ &\leq - (1 - \xi_t)\sigma_t - \xi_t \psi(\nabla f(\xx_t))\\ &= - \sigma_{t+1} \end{align} where the first inequality is true by convexity of $-\psi$ and the second one by the induction hypothesis. Using this bound in \eqref{eq:sublinear_bound_sigma} yields the desired bound \begin{equation} f(\xx_{t}) - \psi(\uu_{t}) \leq \frac{2\diam(\mathcal{D})^2}{t+1} \end{equation}


References

A heartfelt thanks to Nicolas Le Roux and Vlad Niculae for reporting typos on this post.