Three Operator Splitting
Splitting methods decompose a complicated problem into a sequence of simpler subproblems. An idea as fundamental as this one has found applications in many areas like linear algebra,
Splitting methods have also made their way in mathematical optimization. From solving non-smooth problems
Outline:
Introduction
Iteration Complexity Analysis
Convergence Theory
Code
Open Questions
References
Introduction
The Davis-Yin three operator splitting is an algorithm that can solve optimization problems composed of a sum of three terms and was proposed by Damek Davis and Wotao Yin. The method can be seen as a slight generalization of previous methods like the Forward-Douglas-Rachford
The three operator splitting can solve optimization problems of the form
\begin{equation}\label{eq:opt_objective}\tag{OPT} \minimize_{\boldsymbol{x} \in \RR^d} f(\xx) + g(\xx) + h(\xx) ~, \end{equation}
where we have access to the gradient of $f$ and the proximal operator of $g$ and $h$.
In its basic form, the algorithm takes as input an initial gues $\yy_0 \in \RR^d$ and generates a sequence of iterates $\{\yy_1, \yy_2, \ldots\}$ through the formula \begin{align} \zz_t &= \prox_{\gamma h}(\yy_t)~\label{eq:z_update}\\ \xx_t &= \prox_{\gamma g}(2 \zz_t - \yy_t - \gamma \nabla f(\zz_t))\label{eq:x_update}\\ \yy_{t+1} &= \yy_t - \zz_t + \xx_t~.\label{eq:y_update} \end{align}
Each iteration requires to evaluate the proximal operator of $h$ and $g$, as well as the gradient of $f$. It depends on a single step-size parameter $\gamma$, which is a practical advantage with respect to similar methods like Condat-Vu
The three operator splitting is strongly related to other splitting methods.
For example, when $h$ is constant, it defaults to the proximal gradient method. This is easy to verify since in this case the first step becomes $\zz_t = \prox_{\gamma h}(\yy_t) = \yy_t$, and replacing in the next lines gives $\yy_{t+1} = \prox_{\gamma g}(\yy_t - \gamma \nabla f(\yy_t))$, which corresponds to an iteration of the proximal-gradient method. Similarly, it is trivial to verify that whenever $f$ is constant then it defaults to the Douglas-Rachford method. A more detailed comparison with related methods can be found in (Raguet 2018).
Iteration complexity analysis
In this section I will give a simplified proof of the iteration complexity of the method for convex functions.
Assumptions. I assume that $f$, $g$ and $h$ and convex, proper and lower semicontinuous functions. In addition, $f$ is differentiable with Lipschitz gradient, where $L$ denotes its Lipschitz constant.
It is common to analyze optimization algorithms by proving that the objective function decreases at each iteration. However, this turns out to be problematic for the three operator splitting, as the value of the objective function can be infinity at any iterate. This happens for example when both $g$ and $h$ are the indicator function over a set.
Davis and Yin overcame this difficulty by assuming Lipschitz continuity of one of the proximal terms (Davis and Yin 2017, Theorem 3.1 and 3.2).
A different approach was proposed in our recent paper,
An unexpected side effect of using the saddle point suboptimality is that it simplifies considerably the iteration complexity analysis. Below is a proof of the $\mathcal{O}(1/t)$ convergence rate on the saddle point suboptimality for the step-size $\gamma=\frac{1}{L}$. Later we will see that under some extra assumptions it is also possible to derive convergence rate on the objective suboptimality from this result.
Theorem 1 (Ergodic convergence rate). Let $\yy_t, \xx_t, \zz_t$ denote the sequence iterates produced by the three operator splitting, let $\uu_t$ be defined as $\uu_t \defas \frac{1}{\gamma}(\yy_t - \zz_t)$ and let $\overline{\xx}_t \defas \frac{1}{t+1}\sum_{i=0}^t \xx_i, \overline{\uu}_t \defas \frac{1}{t+1}\sum_{i=0}^t \uu_i$ denote the ergodic (or average) sequence of iterates. Then for $\gamma=\frac{1}{L}$ we have the following suboptimality bound, valid for all $t \geq 0$, all $\xx, \uu$ in the domain of $\mathcal{L}$: \begin{equation} \mathcal{L}(\overline\xx_{t}, \uu) - \mathcal{L}(\xx, \overline\uu_{t}) \leq \frac{L\|\yy_0 - \xx - \gamma \uu\|^2}{2(t+1)}~. \end{equation}
The proof technique is the same as in our recent paper,
The proof crucially relies on the following technical lemma that relates the output of a proximal-gradient step with its input and an arbitrary point in the domain:
Lemma 1 (three point inequality). Let $\psi$ be convex and $L$-smooth and let $\eta$ be convex. If $\aa^+ = \prox_{\sigma \eta}(\aa - \sigma \nabla \psi(\aa))$, then for all $\bb$ in the domain of $\eta$ we have \begin{equation} \begin{aligned} &\psi(\aa^+) + \eta(\aa^+) - \psi(\bb) - \eta(\bb) \\ &\qquad \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}\|\aa^+ - \aa\|^2 \end{aligned}\label{eq:three_point_inequality} \end{equation}
By $L$-smoothless and convexity respectively we have the inequalities \begin{aligned} \psi(\aa^+) &\leq \psi(\aa) + \langle \nabla \psi(\aa), \aa^+ - \aa\rangle + \frac{L}{2}\|\aa^+ - \aa\|^2\\ \psi(\aa) &\leq \psi(\bb) + \langle \nabla \psi(\aa), \aa - \bb\rangle~. \end{aligned} Adding them together gives the inequality \begin{equation} \psi(\aa^+) \leq \psi(\bb) + \langle \nabla \psi(\aa), \aa^+ - \bb\rangle + \frac{L}{2}\|\aa^+ - \aa\|^2 \end{equation}
By the definition of proximal operator $\aa^+$ is characterized by the first order optimality conditions \begin{equation} 0 \in \frac{1}{\gamma}(\aa^+ - \aa) + \nabla\psi(\aa) + \partial \eta(\aa^+) \end{equation} from where $\frac{1}{\gamma}(\aa - \aa^+) - \nabla\psi(\aa)\in \partial \eta(\aa^+)$ and so by convexity we have \begin{equation} \eta(\aa^+) - \eta(\bb) \leq \langle \frac{1}{\gamma}(\aa - \aa^+) - \nabla \psi(\aa), \aa^+ - \bb \rangle \end{equation}
Finally, adding both we obtain the claimed bound $$ \psi(\aa^+) - \psi(\bb) + \eta(\aa^+) - \eta(\bb) \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}\|\aa^+ - \aa\|^2~. $$
We will start the proof of this theorem by rewriting the algorithm in an equivalent but more convenient form. Using Moreau's decomposition for the proximal operator
The next step is to use twice Lemma 1 to bound the saddle point suboptimality: \begin{align} \mathcal{L}(&\xx_t, \uu) - \mathcal{L}(\xx, \uu_t)\\ & = \overbrace{f(\xx_t) + \langle \xx_t - \xx, \uu_t\rangle - f(\xx)}^{\psi(\xx_t) - \psi(\xx)} + \overbrace{g(\xx_t) - g(\xx)}^{\eta(\xx_t) - \eta(\xx)} - \langle \xx_t, \uu_t - \uu\rangle - h^*(\uu) + h^*(\uu_t)\\ &\leq \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t - \xx\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2 + \langle \xx_{t-1} - \xx_t, \uu_t - \uu\rangle\nonumber\\ &\qquad\qquad - \overbrace{\langle \xx_{t-1}, \uu_t - \uu\rangle}^{\psi(\uu_t) - \psi(\uu)} + \overbrace{ h^*(\uu_t) - h^*(\uu)}^{\eta(\uu_t) - \eta(\uu)}\\ &\leq \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t - \xx\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2 + \langle \gamma(\uu_{t-1} - \uu_t) + \xx_{t-1} - \xx_{t}, \uu_t - \uu\rangle \\ &\stackrel{\eqref{eq:pd_z_update}}{=} \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t + \gamma \uu_t - \xx - \gamma \uu\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2\\ &\stackrel{\eqref{eq:y_update}}{=} \frac{1}{\gamma}\langle \yy_{t} - \yy_{t+1}, \yy_{t+1} - \yy\rangle + \frac{L}{2}\|\yy_{t+1} - \yy_t\|^2~, \text{ with $\yy \defas \xx + \gamma \uu$}\\ &= \frac{1}{2\gamma}\|\yy_{t} - \yy\|^2 - \frac{1}{2\gamma}\|\yy_{t+1} - \yy\|^2 +\left( \frac{1}{2\gamma} - \frac{L}{2}\right) \|\yy_{t+1} - \yy_t\|^2~, & \end{align} where in the first inequality we have used Lemma 1 with $\psi(\cdot) = f(\cdot) + \langle \cdot, \uu_t\rangle, \eta = g, \sigma = \gamma, \aa=\zz_t$ and in the second inequality we have used the same lemma but this time with $\psi(\cdot) = \langle \xx_{t-1}, \cdot\rangle$, $\eta = h^*$, $\sigma=1/\gamma, \aa=\uu_{t-1}$.
Setting $\gamma=\frac{1}{L}$ makes the last term vanish and adding this last inequality from $0$ to $t$ we obtain \begin{align} \sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i) &\leq \frac{L}{2}\|\yy_0 - \yy\|^2 - \frac{L}{2}\|\yy_{t+1} - \yy\|^2\\ & \leq \frac{L}{2}\|\yy_0 - \yy\|^2\label{eq:sum_suboptimality} \end{align}
Finally, $\mathcal{L}(\xx_i, \uu)$ is a convex function of $\xx_i$ and $- \mathcal{L}(\xx, \uu_i)$ is also a convex function of $\uu_i$. Applying Jensens inequality to this sum gives \begin{align} \sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i) &= (t+1)\frac{1}{t+1}\sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i)\\ &\geq (t+1)\left(\mathcal{L}(\overline{\xx}_i, \uu) - \mathcal{L}(\xx, \overline{\uu}_i)\right)~. \end{align} Combining this with \eqref{eq:sum_suboptimality} and dividing both sides by $(t+1)$ we obtain \begin{equation} \mathcal{L}(\overline{\xx}_t, \uu) - \mathcal{L}(\xx, \overline{\uu}_t) \leq \frac{L\|\yy_0 - \yy\|^2}{2(t+1)}~, \end{equation} which is the desired bound.
A couple of remarks on this theorem. First, the bound is given in terms of saddle point suboptimality. This is a bound that holds for all $\xx, \uu$, which might seem strange as convergence bounds in convex optimization typically only involve the current and optimal iterate. As highlighted in (Gidel 2017),
Second, by definition of convex conjugate, maximizing $\mathcal{L}$ over $\uu$ we recover the primal objective: $\sup_\uu \mathcal{L}(\xx, \uu)$ $= f(\xx) + g(\xx) + \sup_\uu \left\{\langle \xx, \uu\rangle - h^*(\uu)\right\}$ $= f(\xx) + g(\xx) + h(\xx)$. Since the bound in the previous theorem is valid for all $\uu$ in the domain, we can maximize over this variable to obtain a bound in terms of the primal objective suboptimality. Unfortunately, the value of $\uu$ that maximizes the saddle point objective for a fixed $\xx$ is not necessarily bounded, rendering the bound meaningless. However, assuming $\beta_h$-Lipschitz continuity of $h$ and using known results in duality theory that related Lipschitz continuity to bounded domain we can obtain the following result:
Corollary 1. Let $P(\xx) \defas f(\xx) + g(\xx) + h(\xx)$ denote the objective function and $\xx^\star$ any solution to \eqref{eq:opt_objective}. If $h$ is $\beta_h$-Lipschitz then for the step-size $\gamma=\frac{1}{L}$ we have \begin{equation} P(\overline\xx_t) - P(\xx^\star) \leq \frac{L\|\yy_0 - \xx\|^2 + \beta_h^2/L}{t+1} \end{equation}
Let $\widehat\uu_t \defas \argmin_{\uu} \mathcal{L}(\overline\xx_{t}, \uu)$ and $(\xx^\star, \uu^\star)$ be a saddle point of $\mathcal{L}$. Then $\mathcal{L}(\overline\xx_{t}, \widehat\uu_t) = P(\overline\xx_{t})$ and $\mathcal{L}(\xx^\star, \uu^\star) = P(\xx^\star)$ by definition of convex conjugate.
Using this and the previous theorem we can write the following set of inequalities
\begin{align}
&P(\overline\xx_{t}) - P(\xx^\star) = \mathcal{L}(\overline\xx_{t}, \widehat\uu_t) - \mathcal{L}(\xx^\star, \uu^\star)\\
&\leq \mathcal{L}(\overline\xx_{t}, \widehat\uu_t) - \mathcal{L}(\xx^\star, \widehat\uu_t)\\
&\quad \text{ (definition of saddle point, Eq. \eqref{eq:saddle_point} with $\xx=\xx^\star$) }\nonumber\\
&\leq \frac{L}{2 (t+1)} \|\yy_0 - \xx^\star - \gamma \widehat\uu_t\|^2\label{eq:primal_subopt_proof_bound}\\
&\quad\text{ (Theorem 1 with $\xx = \xx^\star, \uu = \widehat\uu_t$)}\nonumber
\end{align}
The $\beta_h$-Lipschitz assumption on $h$ implies that the norm of every element in the domain of $h^*$ is bounded by $\beta_h$, see e.g. Corollary 13.3.3 in Rockafellar 1970.
It is possible to derive stronger convergence rates under stronger assumptions, like linear convergence under strong convexity of the smooth term and smoothness of one of the proximal terms. Proofs of this result can be found in (Davis and Yin, 2015)
Code
I maintain a (very much work-in-progress) implementation of the algorithm in the C-OPT package. Here is the function reference and here and here are a couple of examples.
Open questions
As of August 2018, there are still many open questions surrounding the three operator splitting, here is a short list biased towards my own interests. If you come from the future and some have been answered I would love to hear about it in the comments.
- Non-convex objectives. Given that proximal gradient descent converges to a stationary point even when the smooth term is non-convex, it seems reasonable to think that the three operator splitting exhibits a similar behavior.
- Linear convergence under weaker assumptions. In practice I have always observed an empirical linear convergence of the method on objectives with a strongly convex smooth term and non-smooth proximal terms, yet the strongest known results to date require smoothness of one of the proximal terms to obtain a linear convergence rate. Update December 2018: Ryu and coauthors
have given a negative answer to this conjecture. The authors have developed a framework for obtaining tight convergence rates of a large class of splitting methods that include the three operator splitting. Their results show that is impossibility to obtain a linear convergence rate under assumptions of convexity of $g$ and $h$ and smoothness and strong convexity of $f$. - Practical acceleration. Davis and Yin
propose an accelerated variant of the method which requires to know the strong convexity parameter of the objective. However, this parameter is typically unknown, and so the question arises of whether it is possible to design an accelerated variant which (as FISTA) does not require to know the strong convexity parameter.