Keep the gradient flowing

Three Operator Splitting

$$ \def\aa{\boldsymbol a} \def\bb{\boldsymbol b} \def\cc{\boldsymbol c} \def\xx{\boldsymbol x} \def\zz{\boldsymbol z} \def\uu{\boldsymbol u} \def\vv{\boldsymbol v} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\TT{\boldsymbol T} \def\CC{\boldsymbol C} \def\RR{\mathbb R} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\Fix}{\mathbf{Fix}} \DeclareMathOperator{\prox}{\mathbf{prox}} \def\defas{\stackrel{\text{def}}{=}} $$

Splitting methods decompose a complicated problem into a sequence of simpler subproblems. An idea as fundamental as this one has found applications in many areas like linear algebra, integration of ordinary differential equations, or the Monte-Carlo estimation of expectations, to name a few.

Splitting methods have also made their way in mathematical optimization. From solving non-smooth problems to parallel computing, some of the most effective methods are based on splitting. A method that has recently caught my eye is the Davis-Yin three operator splitting, which is the subject of this post.

Outline:
Introduction
Iteration Complexity Analysis
Convergence Theory
Code
Open Questions
References

Introduction

The Davis-Yin three operator splitting is an algorithm that can solve optimization problems composed of a sum of three terms and was proposed by Damek Davis and Wotao Yin. The method can be seen as a slight generalization of previous methods like the Forward-Douglas-Rachford and the Generalized Forward-Backward, a fact that for some was not appropriately acknowledged. Despite this controversy, the method remains in my opinion one of the most exciting recent developments in optimization because of its excellent empirical performance, elegant formulation and its ease of use.

The three operator splitting can solve optimization problems of the form

\begin{equation}\label{eq:opt_objective}\tag{OPT} \minimize_{\boldsymbol{x} \in \RR^d} f(\xx) + g(\xx) + h(\xx) ~, \end{equation}

where we have access to the gradient of $f$ and the proximal operator of $g$ and $h$.The proximal operator is a crucial element of many splitting methods. For a function $\varphi: \RR^d \to \RR$ and step-size $\gamma > 0$ is denoted $\prox_{\gamma \varphi}$ and defined as $$ \prox_{\gamma \varphi}(\xx) \defas \argmin_{\zz \in \RR^p}\Big\{ \varphi(\zz) + \frac{1}{2 \gamma}\|\xx - \zz\|^2\Big\}~. $$ Despite its simplicity, the proximal operator is at the core of many optimization methods. The following monograph provides a gentle introduction to the topic: Parikh, Neal, and Stephen Boyd. "Proximal algorithms." Foundations and Trends in Optimization, 2014. This formulation allows to express a broad range of problems arising in machine learning and signal processing: $f$ can be any smooth loss function such as the logistic or least squares loss, while the two proximal terms can be generalized to an arbitrary number of them and includes many important penalties like the group lasso with overlap, total variation, $\ell_1$ trend filtering, etc. Furthermore, the penalties can be extended-valued, thus allowing an intersection for convex constraints through the use of the indicator function.

In its basic form, the algorithm takes as input an initial gues $\yy_0 \in \RR^d$ and generates a sequence of iterates $\{\yy_1, \yy_2, \ldots\}$ through the formula \begin{align} \zz_t &= \prox_{\gamma h}(\yy_t)~\label{eq:z_update}\\ \xx_t &= \prox_{\gamma g}(2 \zz_t - \yy_t - \gamma \nabla f(\zz_t))\label{eq:x_update}\\ \yy_{t+1} &= \yy_t - \zz_t + \xx_t~.\label{eq:y_update} \end{align}

Each iteration requires to evaluate the proximal operator of $h$ and $g$, as well as the gradient of $f$. It depends on a single step-size parameter $\gamma$, which is a practical advantage with respect to similar methods like Condat-Vu that depend on two step-size parameters.Considering the three operator splitting and the Condat-Vu algorithm in the same bag is not entirely fair, as the latter can minimize a larger class of objectives of the form \begin{equation} f(\xx) + g(\xx) + h(\boldsymbol{K}\xx)\,, \end{equation} where $\boldsymbol{K}$ is an arbitrary linear operator. Compared to \eqref{eq:opt_objective}, this allows for an extra linear operator inside $h$, which is set to the identity in the three operator splitting. A recently proposed variant of the three operator splitting ("A New Primal–Dual Algorithm for Minimizing the Sum of Three Functions with a Linear Operator", by Ming Yang) allows, as the Condat-Vu algorithm, a linear operator in one of the proximal terms.

The three operator splitting is strongly related to other splitting methods. For example, when $h$ is constant, it defaults to the proximal gradient method. This is easy to verify since in this case the first step becomes $\zz_t = \prox_{\gamma h}(\yy_t) = \yy_t$, and replacing in the next lines gives $\yy_{t+1} = \prox_{\gamma g}(\yy_t - \gamma \nabla f(\yy_t))$, which corresponds to an iteration of the proximal-gradient method. Similarly, it is trivial to verify that whenever $f$ is constant then it defaults to the Douglas-Rachford method. A more detailed comparison with related methods can be found in (Raguet 2018).

Iteration complexity analysis

In this section I will give a simplified proof of the iteration complexity of the method for convex functions.

Assumptions. I assume that $f$, $g$ and $h$ and convex, proper and lower semicontinuous functions. In addition, $f$ is differentiable with Lipschitz gradient, where $L$ denotes its Lipschitz constant.In other words, $f$ verifies $$ \|\nabla f(\xx) - \nabla f(\yy)\|\leq L \|\xx - \yy\| $$ for all $\xx, \yy$ in the domain. The class of functions that verifies this property is sometimes called $L$-smooth functions

It is common to analyze optimization algorithms by proving that the objective function decreases at each iteration. However, this turns out to be problematic for the three operator splitting, as the value of the objective function can be infinity at any iterate. This happens for example when both $g$ and $h$ are the indicator function over a set.

Davis and Yin overcame this difficulty by assuming Lipschitz continuity of one of the proximal terms (Davis and Yin 2017, Theorem 3.1 and 3.2). However, I find this not satisfactory because it leaves out important applications of the algorithm like optimization over an intersection of constraints.

A different approach was proposed in our recent paper, Although this approach seems to be new in the context of analyzing the three operator splitting, it is a classical approach for other primal-dual methods. See for instance the analysis of (Chambolle and Pock, 2016) for a different class of splitting methods. and consists in reformulating the original optimization as a saddle point problem. Let $h^*$ denotes the convex conjugate of $h$, then we have the following identities: \begin{align} &\min_{\xx \in \RR^d} f(\xx) + g(\xx) + h(\xx) \\ &\quad = \min_{\xx \in \RR^d} f(\xx) + g(\xx) + \max_{\uu \in \RR^d}\left\{\langle \xx, \uu \rangle - h^*(\uu)\right\}\\ &\quad= \min_{\xx \in \RR^d} \max_{\uu \in \RR^d} \underbrace{f(\xx) + g(\xx) + \langle \xx, \uu \rangle - h^*(\uu)}_{\defas \mathcal{L}(\xx, \uu)}~. \end{align} We have transformed the original minimization into the problem of finding a saddle point of $\mathcal{L}(\xx, \uu) = f(\xx) + g(\xx) + \langle \xx, \uu \rangle - h^*(\uu)$. Formally, a saddle point of $\mathcal{L}$ is a pair $(\xx^\star, \uu^\star)$ such that the following is verified for any $(\xx, \uu)$ in the domain: \begin{equation}\label{eq:saddle_point} \mathcal{L}(\xx^\star\!, \uu) \leq \mathcal{L}(\xx, \uu^\star) ~. \end{equation} From this definition it follows that $\mathcal{L}(\xx^\star\!, \uu) - \mathcal{L}(\xx, \uu^\star)$ is non-positive for all $(\xx, \uu)$ only at a saddle point and so it is a meaningful convergence criterion.

An unexpected side effect of using the saddle point suboptimality is that it simplifies considerably the iteration complexity analysis. Below is a proof of the $\mathcal{O}(1/t)$ convergence rate on the saddle point suboptimality for the step-size $\gamma=\frac{1}{L}$. Later we will see that under some extra assumptions it is also possible to derive convergence rate on the objective suboptimality from this result.

Theorem 1 (Ergodic convergence rate). Let $\yy_t, \xx_t, \zz_t$ denote the sequence iterates produced by the three operator splitting, let $\uu_t$ be defined as $\uu_t \defas \frac{1}{\gamma}(\yy_t - \zz_t)$ and let $\overline{\xx}_t \defas \frac{1}{t+1}\sum_{i=0}^t \xx_i, \overline{\uu}_t \defas \frac{1}{t+1}\sum_{i=0}^t \uu_i$ denote the ergodic (or average) sequence of iterates. Then for $\gamma=\frac{1}{L}$ we have the following suboptimality bound, valid for all $t \geq 0$, all $\xx, \uu$ in the domain of $\mathcal{L}$: \begin{equation} \mathcal{L}(\overline\xx_{t}, \uu) - \mathcal{L}(\xx, \overline\uu_{t}) \leq \frac{L\|\yy_0 - \xx - \gamma \uu\|^2}{2(t+1)}~. \end{equation}

The proof technique is the same as in our recent paper, and consists very roughly on interpreting the three operator splitting as two consecutive applications of the proximal-gradient algorithm and then use known properties of this method.

The proof crucially relies on the following technical lemma that relates the output of a proximal-gradient step with its input and an arbitrary point in the domain:

Lemma 1 (three point inequality). Let $\psi$ be convex and $L$-smooth and let $\eta$ be convex. If $\aa^+ = \prox_{\sigma \eta}(\aa - \sigma \nabla \psi(\aa))$, then for all $\bb$ in the domain of $\eta$ we have \begin{equation} \begin{aligned} &\psi(\aa^+) + \eta(\aa^+) - \psi(\bb) - \eta(\bb) \\ &\qquad \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}\|\aa^+ - \aa\|^2 \end{aligned}\label{eq:three_point_inequality} \end{equation}

By $L$-smoothless and convexity respectively we have the inequalities \begin{aligned} \psi(\aa^+) &\leq \psi(\aa) + \langle \nabla \psi(\aa), \aa^+ - \aa\rangle + \frac{L}{2}\|\aa^+ - \aa\|^2\\ \psi(\aa) &\leq \psi(\bb) + \langle \nabla \psi(\aa), \aa - \bb\rangle~. \end{aligned} Adding them together gives the inequality \begin{equation} \psi(\aa^+) \leq \psi(\bb) + \langle \nabla \psi(\aa), \aa^+ - \bb\rangle + \frac{L}{2}\|\aa^+ - \aa\|^2 \end{equation}

By the definition of proximal operator $\aa^+$ is characterized by the first order optimality conditions \begin{equation} 0 \in \frac{1}{\gamma}(\aa^+ - \aa) + \nabla\psi(\aa) + \partial \eta(\aa^+) \end{equation} from where $\frac{1}{\gamma}(\aa - \aa^+) - \nabla\psi(\aa)\in \partial \eta(\aa^+)$ and so by convexity we have \begin{equation} \eta(\aa^+) - \eta(\bb) \leq \langle \frac{1}{\gamma}(\aa - \aa^+) - \nabla \psi(\aa), \aa^+ - \bb \rangle \end{equation}

Finally, adding both we obtain the claimed bound $$ \psi(\aa^+) - \psi(\bb) + \eta(\aa^+) - \eta(\bb) \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}\|\aa^+ - \aa\|^2~. $$


We will start the proof of this theorem by rewriting the algorithm in an equivalent but more convenient form. Using Moreau's decomposition for the proximal operatorThe Moreau decomposition relates the proximal operator of a function $\varphi$ and that of its convex conjugate $\varphi^*$ as $$\prox_{\gamma \varphi}(\yy)= \yy - \gamma \prox_{\varphi^*/\gamma}(\yy/\gamma)\,.$$ and the definition of $\uu_t$ the original formulation of the algorithm in Eq. \eqref{eq:z_update}-\eqref{eq:y_update} is equivalent to: \begin{align} \uu_t &= \prox_{h^*/\gamma}(\uu_{t-1} +\xx_{t-1}/\gamma)\\ \zz_t &= \xx_{t-1} + \gamma (\uu_{t-1} - \uu_t)\label{eq:pd_z_update}\\ \xx_t &= \prox_{\gamma g}(\zz_t - \gamma (\nabla f(\zz_t) + \uu_t))\\ \end{align}

The next step is to use twice Lemma 1 to bound the saddle point suboptimality: \begin{align} \mathcal{L}(&\xx_t, \uu) - \mathcal{L}(\xx, \uu_t)\\ & = \overbrace{f(\xx_t) + \langle \xx_t - \xx, \uu_t\rangle - f(\xx)}^{\psi(\xx_t) - \psi(\xx)} + \overbrace{g(\xx_t) - g(\xx)}^{\eta(\xx_t) - \eta(\xx)} - \langle \xx_t, \uu_t - \uu\rangle - h^*(\uu) + h^*(\uu_t)\\ &\leq \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t - \xx\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2 + \langle \xx_{t-1} - \xx_t, \uu_t - \uu\rangle\nonumber\\ &\qquad\qquad - \overbrace{\langle \xx_{t-1}, \uu_t - \uu\rangle}^{\psi(\uu_t) - \psi(\uu)} + \overbrace{ h^*(\uu_t) - h^*(\uu)}^{\eta(\uu_t) - \eta(\uu)}\\ &\leq \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t - \xx\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2 + \langle \gamma(\uu_{t-1} - \uu_t) + \xx_{t-1} - \xx_{t}, \uu_t - \uu\rangle \\ &\stackrel{\eqref{eq:pd_z_update}}{=} \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t + \gamma \uu_t - \xx - \gamma \uu\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2\\ &\stackrel{\eqref{eq:y_update}}{=} \frac{1}{\gamma}\langle \yy_{t} - \yy_{t+1}, \yy_{t+1} - \yy\rangle + \frac{L}{2}\|\yy_{t+1} - \yy_t\|^2~, \text{ with $\yy \defas \xx + \gamma \uu$}\\ &= \frac{1}{2\gamma}\|\yy_{t} - \yy\|^2 - \frac{1}{2\gamma}\|\yy_{t+1} - \yy\|^2 +\left( \frac{1}{2\gamma} - \frac{L}{2}\right) \|\yy_{t+1} - \yy_t\|^2~, & \end{align} where in the first inequality we have used Lemma 1 with $\psi(\cdot) = f(\cdot) + \langle \cdot, \uu_t\rangle, \eta = g, \sigma = \gamma, \aa=\zz_t$ and in the second inequality we have used the same lemma but this time with $\psi(\cdot) = \langle \xx_{t-1}, \cdot\rangle$, $\eta = h^*$, $\sigma=1/\gamma, \aa=\uu_{t-1}$.

Setting $\gamma=\frac{1}{L}$ makes the last term vanish and adding this last inequality from $0$ to $t$ we obtain \begin{align} \sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i) &\leq \frac{L}{2}\|\yy_0 - \yy\|^2 - \frac{L}{2}\|\yy_{t+1} - \yy\|^2\\ & \leq \frac{L}{2}\|\yy_0 - \yy\|^2\label{eq:sum_suboptimality} \end{align}

Finally, $\mathcal{L}(\xx_i, \uu)$ is a convex function of $\xx_i$ and $- \mathcal{L}(\xx, \uu_i)$ is also a convex function of $\uu_i$. Applying Jensens inequality to this sum gives \begin{align} \sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i) &= (t+1)\frac{1}{t+1}\sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i)\\ &\geq (t+1)\left(\mathcal{L}(\overline{\xx}_i, \uu) - \mathcal{L}(\xx, \overline{\uu}_i)\right)~. \end{align} Combining this with \eqref{eq:sum_suboptimality} and dividing both sides by $(t+1)$ we obtain \begin{equation} \mathcal{L}(\overline{\xx}_t, \uu) - \mathcal{L}(\xx, \overline{\uu}_t) \leq \frac{L\|\yy_0 - \yy\|^2}{2(t+1)}~, \end{equation} which is the desired bound.

A couple of remarks on this theorem. First, the bound is given in terms of saddle point suboptimality. This is a bound that holds for all $\xx, \uu$, which might seem strange as convergence bounds in convex optimization typically only involve the current and optimal iterate. As highlighted in (Gidel 2017), a suboptimality criterion that only involves the current and optimal iterate like $\mathcal{L}(\xx_t, \uu^\star) - \mathcal{L}(\xx^\star, \uu_t)$, where $(\xx^\star, \uu^\star)$ is a saddle point, is not meaningful for saddle point convex problems as it can be zero for a point that is not a saddle point. A bound that holds for all $(\xx, \uu)$ is crucial for a meaningful analysis and these bounds are not uncommon in the analysis of primal-dual methods.

Second, by definition of convex conjugate, maximizing $\mathcal{L}$ over $\uu$ we recover the primal objective: $\sup_\uu \mathcal{L}(\xx, \uu)$ $= f(\xx) + g(\xx) + \sup_\uu \left\{\langle \xx, \uu\rangle - h^*(\uu)\right\}$ $= f(\xx) + g(\xx) + h(\xx)$. Since the bound in the previous theorem is valid for all $\uu$ in the domain, we can maximize over this variable to obtain a bound in terms of the primal objective suboptimality. Unfortunately, the value of $\uu$ that maximizes the saddle point objective for a fixed $\xx$ is not necessarily bounded, rendering the bound meaningless. However, assuming $\beta_h$-Lipschitz continuity of $h$ and using known results in duality theory that related Lipschitz continuity to bounded domain we can obtain the following result:

Corollary 1. Let $P(\xx) \defas f(\xx) + g(\xx) + h(\xx)$ denote the objective function and $\xx^\star$ any solution to \eqref{eq:opt_objective}. If $h$ is $\beta_h$-Lipschitz then for the step-size $\gamma=\frac{1}{L}$ we have \begin{equation} P(\overline\xx_t) - P(\xx^\star) \leq \frac{L\|\yy_0 - \xx\|^2 + \beta_h^2/L}{t+1} \end{equation}

Let $\widehat\uu_t \defas \argmin_{\uu} \mathcal{L}(\overline\xx_{t}, \uu)$ and $(\xx^\star, \uu^\star)$ be a saddle point of $\mathcal{L}$. Then $\mathcal{L}(\overline\xx_{t}, \widehat\uu_t) = P(\overline\xx_{t})$ and $\mathcal{L}(\xx^\star, \uu^\star) = P(\xx^\star)$ by definition of convex conjugate. Using this and the previous theorem we can write the following set of inequalities \begin{align} &P(\overline\xx_{t}) - P(\xx^\star) = \mathcal{L}(\overline\xx_{t}, \widehat\uu_t) - \mathcal{L}(\xx^\star, \uu^\star)\\ &\leq \mathcal{L}(\overline\xx_{t}, \widehat\uu_t) - \mathcal{L}(\xx^\star, \widehat\uu_t)\\ &\quad \text{ (definition of saddle point, Eq. \eqref{eq:saddle_point} with $\xx=\xx^\star$) }\nonumber\\ &\leq \frac{L}{2 (t+1)} \|\yy_0 - \xx^\star - \gamma \widehat\uu_t\|^2\label{eq:primal_subopt_proof_bound}\\ &\quad\text{ (Theorem 1 with $\xx = \xx^\star, \uu = \widehat\uu_t$)}\nonumber \end{align} The $\beta_h$-Lipschitz assumption on $h$ implies that the norm of every element in the domain of $h^*$ is bounded by $\beta_h$, see e.g. Corollary 13.3.3 in Rockafellar 1970. This way we bound \begin{align} \|\yy_0 - \xx^\star - \gamma\widehat\uu_t\|^2 &\leq 2 \|\yy_0 - \xx^\star\|^2 + 2 \gamma^2\| \widehat\uu_t\|^2\\ &\leq 2 \|\yy_0 - \xx^\star\|^2 + 2 \gamma^2 \beta_h^2~. \end{align} Plugging this bound into Eq. \eqref{eq:primal_subopt_proof_bound} and using $\gamma=\frac{1}{L}$ we obtain the claimed bound \begin{equation} P(\overline\xx_{t}) - P(\xx^\star) \leq \frac{L\|\yy_0 - \xx^\star\|^2 + \beta_h^2/L}{t+1}~. \end{equation}

It is possible to derive stronger convergence rates under stronger assumptions, like linear convergence under strong convexity of the smooth term and smoothness of one of the proximal terms. Proofs of this result can be found in (Davis and Yin, 2015) and (Pedregosa and Gidel, 2018).


Code

I maintain a (very much work-in-progress) implementation of the algorithm in the C-OPT package. Here is the function reference and here and here are a couple of examples.

Open questions

As of August 2018, there are still many open questions surrounding the three operator splitting, here is a short list biased towards my own interests. If you come from the future and some have been answered I would love to hear about it in the comments.


References