Keep the gradient flowinghttp://fa.bianp.net/2023-11-19T00:00:00+01:00Optimization Nuggets: Stochastic Polyak Step-size, Part 22023-11-19T00:00:00+01:002023-11-19T00:00:00+01:00Fabian Pedregosa and <a href='https://fabian-sp.github.io/'>Fabian Schaipp</a>tag:fa.bianp.net,2023-11-19:/blog/2023/sps2/
<p>
This blog post discusses the convergence rate of the Stochastic Gradient Descent with Stochastic Polyak Step-size (SGD-SPS) algorithm for minimizing a finite sum objective. Building upon the proof of the previous post, we show that the convergence rate can be improved to O(1/t) under the additional assumption that …</p>
<p>
This blog post discusses the convergence rate of the Stochastic Gradient Descent with Stochastic Polyak Step-size (SGD-SPS) algorithm for minimizing a finite sum objective. Building upon the proof of the previous post, we show that the convergence rate can be improved to O(1/t) under the additional assumption that the objective function is strongly convex.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']],
tags: 'ams'
},
svg: {
fontCache: 'global'
}
};
</script>
<script type="text/javascript" id="MathJax-script" async src="/node_modules/mathjax3/es5/tex-svg.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@inproceedings{gunasekar2018characterizing,
title={Characterizing implicit bias in terms of optimization geometry},
author={Gunasekar, Suriya and Lee, Jason and Soudry, Daniel and Srebro, Nathan},
booktitle={International Conference on Machine Learning},
pages={1832--1841},
year={2018},
organization={PMLR},
url={https://arxiv.org/pdf/1802.08246.pdf}
}
@inproceedings{loizou2021stochastic,
title={Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence},
author={Loizou, Nicolas and Vaswani, Sharan and Laradji, Issam Hadj and Lacoste-Julien, Simon},
booktitle={International Conference on Artificial Intelligence and Statistics},
pages={1306--1314},
year={2021},
url={https://arxiv.org/pdf/2002.10542.pdf},
organization={PMLR}
}
@article{orvieto2022dynamics,
title={Dynamics of SGD with stochastic polyak stepsizes: Truly adaptive variants and convergence to exact solution},
author={Orvieto, Antonio and Lacoste-Julien, Simon and Loizou, Nicolas},
journal={Advances in Neural Information Processing Systems},
url={https://arxiv.org/pdf/2205.04583.pdf},
year={2021}
}
@inproceedings{berrada2020training,
title={Training neural networks for and by interpolation},
author={Berrada, Leonard and Zisserman, Andrew and Kumar, M Pawan},
booktitle={International conference on machine learning},
pages={799--809},
url={https://arxiv.org/pdf/1906.05661.pdf},
year={2020},
organization={PMLR}
}
@article{polyak1987introduction,
title={Introduction to optimization},
author={Polyak, Boris},
year={1987},
journal={Optimization Software},
url={https://www.researchgate.net/profile/Boris-Polyak-2/publication/342978480_Introduction_to_Optimization/links/5f1033e5299bf1e548ba4636/Introduction-to-Optimization.pdf}
}
@article{boyd2003subgradient,
title={Subgradient methods},
author={Boyd, Stephen and Xiao, Lin and Mutapcic, Almir},
journal={lecture notes of EE392o, Stanford University, Autumn Quarter},
volume={2004},
pages={2004--2005},
year={2003}
}
@article{strohmer2009randomized,
title={A randomized Kaczmarz algorithm with exponential convergence},
author={Strohmer, Thomas and Vershynin, Roman},
journal={Journal of Fourier Analysis and Applications},
volume={15},
number={2},
pages={262--278},
year={2009},
publisher={Springer},
url={https://link.springer.com/content/pdf/10.1007/s00041-008-9030-4.pdf}
}
@article{polyak1969minimization,
title={Minimization of unsmooth functionals},
author={Polyak, Boris Teodorovich},
journal={USSR Computational Mathematics and Mathematical Physics},
year={1969},
publisher={Elsevier},
url={https://www.researchgate.net/profile/Boris-Polyak-2/publication/223841567_The_Method_of_Projections_for_Finding_the_Common_Point_of_Convex_Sets/links/59e98201aca272bc42a181a0/The-Method-of-Projections-for-Finding-the-Common-Point-of-Convex-Sets.pdf}
}
@book{shor1985minimization,
title={Minimization methods for non-differentiable functions},
author={Shor, Naum Zuselevich},
volume={3},
year={1985},
publisher={Springer Science \& Business Media},
url={https://doi.org/10.1007/978-3-642-82118-9}
}
@article{brannlund1995generalized,
title={A generalized subgradient method with relaxation step},
author={Brännlund, Ulf},
journal={Mathematical Programming},
volume={71},
number={2},
pages={207--219},
year={1995},
publisher={Springer},
url={https://link.springer.com/content/pdf/10.1007/BF01585999.pdf}
}
@article{nedic2001convergence,
title={Convergence rate of incremental subgradient algorithms},
author={Nedich, Angelia and Bertsekas, Dimitri},
journal={Stochastic optimization: algorithms and applications},
pages={223--264},
year={2001},
url={https://doi.org/10.1007/978-1-4757-6594-6_11},
publisher={Springer}
}
@article{hazan2019revisiting,
title={Revisiting the Polyak step size},
author={Hazan, Elad and Kakade, Sham},
journal={arXiv preprint arXiv:1905.00313},
url={https://arxiv.org/abs/1905.00313},
year={2019}
}
@article{karczmarz1937angenaherte,
title={Angenaherte auflosung von systemen linearer glei-chungen},
author={Karczmarz, Stefan},
journal={Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat.},
pages={355--357},
url={https://faculty.sites.iastate.edu/esweber/files/inline-files/kaczmarz_english_translation_1937.pdf},
year={1937}
}
@article{schmidt2013fast,
title={Fast convergence of stochastic gradient descent under a strong growth condition},
author={Schmidt, Mark and Roux, Nicolas Le},
journal={arXiv preprint arXiv:1308.6370},
year={2013},
url={https://arxiv.org/pdf/1308.6370.pdf}
}
@inproceedings{ma2018power,
title={The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning},
author={Ma, Siyuan and Bassily, Raef and Belkin, Mikhail},
booktitle={International Conference on Machine Learning},
pages={3325--3334},
year={2018},
organization={PMLR},
url={https://proceedings.mlr.press/v80/ma18a.html}
}
@article{vaswani2019painless,
title={Painless stochastic gradient: Interpolation, line-search, and convergence rates},
author={Vaswani, Sharan and Mishkin, Aaron and Laradji, Issam and Schmidt, Mark and Gidel, Gauthier and Lacoste-Julien, Simon},
journal={Advances in neural information processing systems},
volume={32},
year={2019},
url={https://arxiv.org/abs/1905.09997}
}
@article{crammer2006online,
title={Online passive aggressive algorithms},
author={Crammer, Koby and Dekel, Ofer and Keshet, Joseph and Shalev-Shwartz, Shai and Singer, Yoram},
year={2006},
url={https://www.jmlr.org/papers/volume7/crammer06a/crammer06a.pdf}
}
@article{gower2022cutting,
title={Cutting some slack for SGD with adaptive Polyak stepsizes},
author={Gower, Robert M and Blondel, Mathieu and Gazagnadou, Nidham and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2202.12328},
year={2022},
url={https://arxiv.org/abs/2202.12328}
}
@article{garrigos2023function,
title={Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM},
author={Garrigos, Guillaume and Gower, Robert M and Schaipp, Fabian},
journal={arXiv preprint arXiv:2307.14528},
url={https://arxiv.org/abs/2307.14528},
year={2023}
}
@article{nesterov2006cubic,
title={Cubic regularization of Newton method and its global performance},
author={Nesterov, Yurii and Polyak, Boris T},
journal={Mathematical Programming},
volume={108},
number={1},
pages={177--205},
year={2006},
publisher={Springer},
url={https://doi.org/10.1007/s10107-006-0706-8}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC2, false);
</script>
<p id="TOC" class="framed"></p>
$$
\require{mathtools}
\require{color}
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\ell}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colormomentum}{RGB}{27, 158, 119}
\def\cvx{{\color{colormomentum}\mu}}
\definecolor{color1}{RGB}{127,201,127}
\definecolor{color2}{RGB}{179,226,205}
\definecolor{color3}{RGB}{253,205,172}
\definecolor{color4}{RGB}{203,213,232}
\definecolor{colorstepsize}{RGB}{215,48,39}
\def\stepsize{{\color{color3}{\boldsymbol{\gamma}}}}
\def\harmonic{{\color{colorstep2}\boldsymbol{h}}}
\def\cvx{{\color{colorcvx}\boldsymbol{\mu}}}
\def\smooth{{\color{colorsmooth}\boldsymbol{L}}}
\def\noise{{\color{colornoise}\boldsymbol{\sigma}}}
$$
</div>
<h2>Faster Convergence under Strong Convexity</h2>
<p>
Shortly after I published <a href="/blog/2023/sps/">my last post</a> on the convergence of the stochastic Polyak step-size, my name buddy <a href="https://fabian-sp.github.io/">Fabian Schaipp</a> pointed out that the convergence rate can be improved to $\mathcal{O}(1/t)$ under strong convexity of the objective.
</p>
<p>
<blockquote class="twitter-tweet" data-theme="dark"><p lang="en" dir="ltr">small update: using the same techniques + the proof technique from <a href="https://twitter.com/HazanPrinceton?ref_src=twsrc%5Etfw">@HazanPrinceton</a> 's paper, you can also show 1/t convergence for strongly convex, nonsmooth. You need to assume bounded gradients *only for a bounded set*. <a href="https://t.co/fDvSmkHjd9">pic.twitter.com/fDvSmkHjd9</a></p>— Fabian Schaipp (@FSchaipp) <a href="https://twitter.com/FSchaipp/status/1712813907314679841?ref_src=twsrc%5Etfw">October 13, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</p>
<p>
This faster rate uses the same technique that Elad Hazan and Sham Kakade<dt-cite key="hazan2019revisiting"></dt-cite> used to analyze the deterministic Polyak step-size. This proof is rather short and elegant, so I decided to write another blog post about it with <a href="https://fabian-sp.github.io/">Fabian Schaipp</a>.
</p>
<h2>Stochastic Gradient Descent with Stochastic Polyak Step-size</h2>
<p>
We'll consider the same setting as in the <a href="/blog/2023/sps/">previous post</a>. We aim to minimize a finite sum objective $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$, where each $f_i$ is star-convex and $f$ is $\mu$-strongly convex. The SGD with stochastic Polyak step-size (SGD-SPS) algorithm<dt-note>The variant described here was proposed by <a href="https://arxiv.org/pdf/2307.14528.pdf">(Garrigos et al. 2023)</a> under the name $\text{SPS}_+$. </dt-note> <dt-cite key="garrigos2023function"></dt-cite> is defined by the following recursion:
\begin{equation}\label{eq:sps}
\begin{aligned}
& \text{sample uniformly at random } i \in \{1, \ldots, n\}\\
& \stepsize_t = \frac{1}{\|\nabla f_i(x_t)\|^2}(f_i(x_t) - f_i(x_\star))_+ \\
& x_{t+1} = x_t - \stepsize_t \nabla f_i(x_t)
\end{aligned}
\end{equation}
where $x_\star = \arg \min f(x)$ is the solution to the problem (in this case the solution is unique because of strong convexity).
</p>
<h2>$\mathcal{O}(1/t)$ rate under strong convexity</h2>
<p>
While the <a href="/blog/2023/sps/">previous post</a> showed an $\mathcal{O}(1/\sqrt{t})$ convergence rate under star-convexity, this post shows that the convergence rate can be improved to $\mathcal{O}(1/t)$ under the additional assumption that $f$ (but not necessarily the $f_i$'s ) are strongly convex.
</p>
<p>
Without further ado, here is the main result of this post.
</p>
<p class="theorem framed">
Assume that $f$ is $\mu$-strongly convex and $f_i$ is star-convex around the minimizer $x_\star$ for all $i$. Furthermore, we'll also assume that the subgradients are bounded in the ball $\mathcal{B}$ with center $x_\star$ and radius $\|x_0 - x_\star\|$, that is, we have $\|\nabla f_i(x)\|\leq G$ for all $i$ and $x \in \mathcal{B}$. Then, SGD-SPS converges in expected error at a rate of at least $O(1/(T+1))$. That is, after $T$ steps we have
\begin{equation}
\EE \|x_T - x_\star\|^2 \leq \frac{4 G^2}{\mu^2 (T+1)} \,.
\end{equation}
</p>
<div class="proof">
<p>
From the main recursive inequality proven in Eq. (9) of the <a href="/blog/2023/sps/">previous post</a>, we have that
\begin{equation}\label{eq:key_inequality}
\EE(f(x_t) - f(x_\star))^2 \leq G^2(\EE\|x_t - x_\star\|^2 - \EE\|x_{t+1} - x_\star\|^2)
\end{equation}
Strong convexity of $f$ implies that $f(x_t) - f(x_\star) \geq \frac{\mu}{2}\|x_t - x_\star\|^2$. Plugging this into the previous inequality and grouping terms we have
\begin{equation}
\EE\|x_{t+1} - x_\star\|^2\leq \EE\|x_t - x_\star\|^2\left(1 - \frac{\mu^2}{4 G^2}\EE\|x_t - x_\star\|^2\right) \,.
\end{equation}
Let $a_t \defas \frac{\mu^2}{4 G^2}\EE\|x_t - x_\star\|^2$. Multiplying both sides of the previous equation by $\frac{\mu^2}{4 G^2}$ we have
\begin{equation}
a_{t+1} \leq a_t (1 - a_t) \,.
\end{equation}
We'll now prove by induction that the inequality above implies $a_t \leq \frac{1}{t+1}$ for all $t$. For the base case $t=0$ we have by <a href="/blog/2017/optimization-inequalities-cheatsheet/#sec4">strong convexity</a> and the boundedness of the gradients that
\begin{equation}
a_0 = \frac{\mu^2}{4 G^2}\EE\|x_0 - x_\star\|^2 \leq \frac{1}{4 G^2}\|\nabla f(x_0)\|^2 \leq \frac{1}{4} \,.
\end{equation}
For $t=1$, we have $a_1 \leq a_0(1 - a_0)$ with $a_0 \leq 1$. As $x \mapsto x (1 - x)$ has a maximal value $\frac{1}{4}$ over the interval $[0, 1]$ we have $a_1 \leq \frac{1}{4}$. For the induction step, assume that $a_{t-1} \leq \frac{1}{t}$. Then, for $t\geq 2$, we have
\begin{align}
a_t &\leq a_{t-1}(1 - a_{t-1}) \leq \max_{x \in [0, 1/t]} x (1 - x) \\
&= \frac{1}{t}(1 - \frac{1}{t}) = \frac{1}{t+1} \frac{t^2 - 1}{t^2} \leq \frac{1}{t+1} \,.
\end{align}
We have hence proven that $a_t \leq \frac{1}{t+1}$ for all $t$. Plugging this back into the definition of $a_t$ and multiplying both sides by $\frac{4 G^2}{\mu^2}$ yields the desired result.
</p>
</div>
<p>
It's important to note that in the previous result the bounded subgradients assumption is restricted to the <i>bounded</i> set $\mathcal{B}$.
We could make this assumption instead of the more common bounded in the domain assumption thanks to the error monotonicity proven in Eq. (5) of the <a href="/blog/2023/sps/">previous post</a>.
Without this result, we would have to assume bounded subgradients on the whole space, which is a contradictory assumption with strong convexity (in other words, the set of functions that verify both bounded subgradients on the whole space and strong convexity is empty).
</p>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing as
</p>
<blockquote>
<a href="http://fa.bianp.net/blog/2023/sps2/">Stochastic Polyak Step-size, Faster rates under strong convexity</a>, Fabian Pedregosa and Fabian Schaipp, 2023
</blockquote>
<p>
with bibtex entry:
</p>
<pre>
<code style="width: 100%">
@misc{pedregosa2023sps2,
title={Stochastic Polyak Step-size, Faster rates under strong convexity},
author={Pedregosa, Fabian and Schaipp, Fabian},
howpublished = {\url{http://fa.bianp.net/blog/2023/sps2/}},
year={2023}
}
</code>
</pre>
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
Optimization Nuggets: Stochastic Polyak Step-size2023-09-29T00:00:00+02:002023-09-29T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2023-09-29:/blog/2023/sps/
<p>
The stochastic Polyak step-size (SPS) is a practical variant of the Polyak step-size for stochastic optimization. In this blog post, we'll discuss the algorithm and provide a simple analysis for convex objectives with bounded gradients.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']],
tags: 'ams'
},
svg: {
fontCache: 'global'
}
};
</script>
<script type="text/javascript" id="MathJax-script" async src="/node_modules/mathjax3/es5/tex-svg.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@inproceedings{gunasekar2018characterizing,
title={Characterizing implicit …</script>
<p>
The stochastic Polyak step-size (SPS) is a practical variant of the Polyak step-size for stochastic optimization. In this blog post, we'll discuss the algorithm and provide a simple analysis for convex objectives with bounded gradients.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']],
tags: 'ams'
},
svg: {
fontCache: 'global'
}
};
</script>
<script type="text/javascript" id="MathJax-script" async src="/node_modules/mathjax3/es5/tex-svg.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@inproceedings{gunasekar2018characterizing,
title={Characterizing implicit bias in terms of optimization geometry},
author={Gunasekar, Suriya and Lee, Jason and Soudry, Daniel and Srebro, Nathan},
booktitle={International Conference on Machine Learning},
pages={1832--1841},
year={2018},
organization={PMLR},
url={https://arxiv.org/pdf/1802.08246.pdf}
}
@inproceedings{loizou2021stochastic,
title={Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence},
author={Loizou, Nicolas and Vaswani, Sharan and Laradji, Issam Hadj and Lacoste-Julien, Simon},
booktitle={International Conference on Artificial Intelligence and Statistics},
pages={1306--1314},
year={2021},
url={https://arxiv.org/pdf/2002.10542.pdf},
organization={PMLR}
}
@article{orvieto2022dynamics,
title={Dynamics of SGD with stochastic polyak stepsizes: Truly adaptive variants and convergence to exact solution},
author={Orvieto, Antonio and Lacoste-Julien, Simon and Loizou, Nicolas},
journal={Advances in Neural Information Processing Systems},
url={https://arxiv.org/pdf/2205.04583.pdf},
year={2021}
}
@inproceedings{berrada2020training,
title={Training neural networks for and by interpolation},
author={Berrada, Leonard and Zisserman, Andrew and Kumar, M Pawan},
booktitle={International conference on machine learning},
pages={799--809},
url={https://arxiv.org/pdf/1906.05661.pdf},
year={2020},
organization={PMLR}
}
@article{polyak1987introduction,
title={Introduction to optimization},
author={Polyak, Boris},
year={1987},
journal={Optimization Software},
url={https://www.researchgate.net/profile/Boris-Polyak-2/publication/342978480_Introduction_to_Optimization/links/5f1033e5299bf1e548ba4636/Introduction-to-Optimization.pdf}
}
@article{boyd2003subgradient,
title={Subgradient methods},
author={Boyd, Stephen and Xiao, Lin and Mutapcic, Almir},
journal={lecture notes of EE392o, Stanford University, Autumn Quarter},
volume={2004},
pages={2004--2005},
year={2003}
}
@article{strohmer2009randomized,
title={A randomized Kaczmarz algorithm with exponential convergence},
author={Strohmer, Thomas and Vershynin, Roman},
journal={Journal of Fourier Analysis and Applications},
volume={15},
number={2},
pages={262--278},
year={2009},
publisher={Springer},
url={https://link.springer.com/content/pdf/10.1007/s00041-008-9030-4.pdf}
}
@article{polyak1969minimization,
title={Minimization of unsmooth functionals},
author={Polyak, Boris Teodorovich},
journal={USSR Computational Mathematics and Mathematical Physics},
year={1969},
publisher={Elsevier},
url={https://www.researchgate.net/profile/Boris-Polyak-2/publication/223841567_The_Method_of_Projections_for_Finding_the_Common_Point_of_Convex_Sets/links/59e98201aca272bc42a181a0/The-Method-of-Projections-for-Finding-the-Common-Point-of-Convex-Sets.pdf}
}
@book{shor1985minimization,
title={Minimization methods for non-differentiable functions},
author={Shor, Naum Zuselevich},
volume={3},
year={1985},
publisher={Springer Science \& Business Media},
url={https://doi.org/10.1007/978-3-642-82118-9}
}
@article{brannlund1995generalized,
title={A generalized subgradient method with relaxation step},
author={Brännlund, Ulf},
journal={Mathematical Programming},
volume={71},
number={2},
pages={207--219},
year={1995},
publisher={Springer},
url={https://link.springer.com/content/pdf/10.1007/BF01585999.pdf}
}
@article{nedic2001convergence,
title={Convergence rate of incremental subgradient algorithms},
author={Nedich, Angelia and Bertsekas, Dimitri},
journal={Stochastic optimization: algorithms and applications},
pages={223--264},
year={2001},
url={https://doi.org/10.1007/978-1-4757-6594-6_11},
publisher={Springer}
}
@article{hazan2019revisiting,
title={Revisiting the Polyak step size},
author={Hazan, Elad and Kakade, Sham},
journal={arXiv preprint arXiv:1905.00313},
url={https://arxiv.org/abs/1905.00313},
year={2019}
}
@article{karczmarz1937angenaherte,
title={Angenaherte auflosung von systemen linearer glei-chungen},
author={Karczmarz, Stefan},
journal={Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat.},
pages={355--357},
url={https://faculty.sites.iastate.edu/esweber/files/inline-files/kaczmarz_english_translation_1937.pdf},
year={1937}
}
@article{schmidt2013fast,
title={Fast convergence of stochastic gradient descent under a strong growth condition},
author={Schmidt, Mark and Roux, Nicolas Le},
journal={arXiv preprint arXiv:1308.6370},
year={2013},
url={https://arxiv.org/pdf/1308.6370.pdf}
}
@inproceedings{ma2018power,
title={The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning},
author={Ma, Siyuan and Bassily, Raef and Belkin, Mikhail},
booktitle={International Conference on Machine Learning},
pages={3325--3334},
year={2018},
organization={PMLR},
url={https://proceedings.mlr.press/v80/ma18a.html}
}
@article{vaswani2019painless,
title={Painless stochastic gradient: Interpolation, line-search, and convergence rates},
author={Vaswani, Sharan and Mishkin, Aaron and Laradji, Issam and Schmidt, Mark and Gidel, Gauthier and Lacoste-Julien, Simon},
journal={Advances in neural information processing systems},
volume={32},
year={2019},
url={https://arxiv.org/abs/1905.09997}
}
@article{crammer2006online,
title={Online passive aggressive algorithms},
author={Crammer, Koby and Dekel, Ofer and Keshet, Joseph and Shalev-Shwartz, Shai and Singer, Yoram},
year={2006},
url={https://www.jmlr.org/papers/volume7/crammer06a/crammer06a.pdf}
}
@article{gower2022cutting,
title={Cutting some slack for SGD with adaptive Polyak stepsizes},
author={Gower, Robert M and Blondel, Mathieu and Gazagnadou, Nidham and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2202.12328},
year={2022},
url={https://arxiv.org/abs/2202.12328}
}
@article{garrigos2023function,
title={Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM},
author={Garrigos, Guillaume and Gower, Robert M and Schaipp, Fabian},
journal={arXiv preprint arXiv:2307.14528},
url={https://arxiv.org/abs/2307.14528},
year={2023}
}
@article{nesterov2006cubic,
title={Cubic regularization of Newton method and its global performance},
author={Nesterov, Yurii and Polyak, Boris T},
journal={Mathematical Programming},
volume={108},
number={1},
pages={177--205},
year={2006},
publisher={Springer},
url={https://doi.org/10.1007/s10107-006-0706-8}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC2, false);
</script>
<p id="TOC" class="framed"></p>
$$
\require{mathtools}
\require{color}
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\ell}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colormomentum}{RGB}{27, 158, 119}
\def\cvx{{\color{colormomentum}\mu}}
\definecolor{colorstepsize}{RGB}{217,95,2}
\def\stepsize{{\color{colorstepsize}{\boldsymbol{\gamma}}}}
\def\harmonic{{\color{colorstep2}\boldsymbol{h}}}
\def\cvx{{\color{colorcvx}\boldsymbol{\mu}}}
\def\smooth{{\color{colorsmooth}\boldsymbol{L}}}
\def\noise{{\color{colornoise}\boldsymbol{\sigma}}}
\definecolor{color1}{RGB}{127,201,127}
\definecolor{color2}{RGB}{179,226,205}
\definecolor{color3}{RGB}{253,205,172}
\definecolor{color4}{RGB}{203,213,232}
$$
</div>
<h2>SGD with Polyak Step-size</h2>
<p>
A recent optimization breakthrough result is the realization that the venerable Polyak step-size,<dt-cite key="polyak1969minimization"></dt-cite> <dt-cite key="hazan2019revisiting"></dt-cite> originally developed for deterministic/full gradient optimization, extends naturally to stochastic optimization.<dt-cite key="berrada2020training"></dt-cite> <dt-cite key="loizou2021stochastic"></dt-cite> This step-size is particularly interesting because does not require hyperparameter tuning.
</p>
<p>
Let $f$ be an average of $n$ functions $f_1, \ldots, f_n$, and we consider the problem of finding a minimizer of $f$:
\begin{equation}\label{eq:opt}
x_{\star} \in \argmin_{x \in \RR^p} \left\{ f(x) \defas \frac{1}{n} \sum_{i=1}^n f_i(x) \right\}\,,
\end{equation}
with access to subgradients of $f_i$. In this post we won't be assuming smoothness of $f_i$, so this subgradient is not necessarily unique. To avoid the notational clutter from having to deal with sets of subgradients, we take the convention that $\nabla f_i(x)$ denotes <i>any</i> subgradient of $f_i(x)$.
</p>
<p>
There have been by now a few formulations of this method, with subtle differences. The variant that I present here corresponds to what <a href="https://arxiv.org/pdf/2307.14528.pdf">Garrigos et al. 2023</a> call the $\text{SPS}_+$ method.
Unlike the earlier variants of <a href="https://arxiv.org/pdf/1906.05661.pdf">Berrada et al. 2020</a> and <a href="https://arxiv.org/pdf/2002.10542.pdf">Loizou et al. 2021</a>, this variant doesn't have any maximum step-size.
</p>
<p class="framed">
<b class="tufte-underline">SGD with Stochastic Polyak Step-size (SGD-SPS)</b><br>
<b>Input</b>: starting guess $x_0 \in \RR^d$. <br>
<b>For</b> $t=0, 1, \ldots$ <br>
\begin{equation}\label{eq:sps}
\begin{aligned}
& \text{sample uniformly at random } i \in \{1, \ldots, n\}\\
& \stepsize_t = \frac{1}{\|\nabla f_i(x_t)\|^2}(f_i(x_t) - f_i(x_\star))_+ \\
&x_{t+1} = x_t - \stepsize_t \nabla f_i(x_t)
\end{aligned}
\end{equation}
</p>
<p>
In the algorithm above, $(z)_+ \defas \max\{z, 0\}$ denotes the positive part function, and $\frac{1}{\|\nabla f_i(x_t)\|^2}$ should be understood as the pseudoinverse of $\|\nabla f_i(x_t)\|^2$ so that the step-size is zero whenever $\|\nabla f_i(x_t)\| = 0$.
</p>
<!-- <p>
This step-size $\stepsize_t$ defaults to the Polyak step-size<dt-cite key="polyak1969minimization"></dt-cite> $\frac{f(x_t) - f(x_\star)}{\|\nabla f_i(x_t)\|^2}$ when $n=1$, and as such, it can be seen as a generalization of the Polyak step-size to the stochastic setting.
</p> -->
<p>
This step-size $\stepsize_t$ is a marvel of simplicity and efficiency. The denominator $\|\nabla f_i(x_t)\|^2$ is the squared norm of the partial gradient, which we anyway need to compute for the update direction. The numerator depends on the partial function value $f_i(x_t)$, which in most automatic differentiation frameworks can be obtained for free as a by-product of the gradient. All in all, the overhead of the stochastic Polyak step-size is negligible compared to the cost of computing the partial gradient.
</p>
<p>
Not everything is rainbows and unicorns though.
The Polyak step-size has one big drawback. It requires knowledge of the partial function at the minimizer, $f_i(x_\star)$.<dt-note>Other variants, such as the one from <a href="https://arxiv.org/pdf/2002.10542.pdf">(Loizou et al., 2021)</a>, require instead knowledge of $\inf_z f_i(z)$, which is potentially easier to estimate than $\inf_z f(x)$.</dt-note> This is a problem because the minimizer $x_\star$ is unknown. However, in some deep learning applications, the model has enough capacity to achieve $f_i(x_\star) = 0$ for all $i$. In this case, SPS can be used with $f_i(x_\star) = 0$. Examples of such objectives are over-determined least squares problems, SVMs with separable data, or a neural network with a quadratic loss and enough capacity to fit the data.
</p>
<h2>Convergence under Star Convexity and Locally Bounded Gradients</h2>
<p>
We'll now prove that the SGD-SPS algorithm enjoys a $\mathcal{O}(1/\sqrt{t})$ convergence rate for star-convex objectives with bounded gradients (optimal for this class of functions!).<dt-note>Faster convergence rates are possible under further assumptions. For example, <a href="https://arxiv.org/pdf/2002.10542.pdf">(Loizou et al., 2021)</a> show a $\mathcal{O}(1/t)$ rate for smooth objective and exponential convergence for strongly convex objectives. However, contrary to the result in this blog post, these convergence rates are <i>up to a neighborhood</i> that depends on $f(x_\star) - \EE[f_i(x_\star)]$.</dt-note> This result was recently developed by Garrigos et al.<dt-cite key="garrigos2023function"></dt-cite> and the proof below is a minor relaxation of theirs.
</p>
<p>
The two assumptions that we will make on the objective are star-convexity<dt-cite key="nesterov2006cubic"></dt-cite> <dt-note>This condition is sometimes also known as <i>one-point convexity</i>.</dt-note> and locally bounded gradients. These assumptions are satisfied by many objectives of interest, including least-squares and logistic regression.<dt-note>We're implicity assuming in \eqref{eq:opt} that $f$ admits at least one minimizer. This leaves out some objectives such as over-parametrized and un-regularized logistic regression, where the infimum is never achieved.
</dt-note>
</p>
<p class="definition" text="star-convex">
A function $f_i$ is star-convex around $x_\star$<dt-note>Note that here $x_\star$ is the minimizer of $f$ but not necessarily that of $f_i$. This is a slight generalization with respect to the definition of star convexity in the original works of Nesterov and Polyak, where the notion of star-convexity is with respect to its minimizer.</dt-note> if the following inequality is verified for all $x$ in the domain:
\begin{equation}\label{eq:star-convexity}
f_i(x) - f_i(x_\star)\leq \langle \nabla f_i(x), x - x_\star\rangle \,.
\end{equation}
</p>
<p>
Convex functions are star-convex, since they verify \eqref{eq:star-convexity} for all pairs of points $x, x_\star$, while in the definition above $x_\star$ is fixed. The converse is not true, and star-convex functions can have non-convex level sets (for example star-shaped).<dt-note><img src="/images/2023/star-convex.png" alt=""> <br>
Heatmap of a function that is star-convex but not convex. Credits: <a href="https://www.cs.purdue.edu/homes/pvaliant/starconvex.pdf">(Lee and Valiant, 2016)</a>.</dt-note>
</p>
<p>
The other assumption that we'll make is that the gradients $\nabla f_i$ are locally bounded. This assumption is satisfied by many objectives of interest, including least-squares and logistic regression. With this, here is the main result of this post.
</p>
<p class="theorem framed">
Assume that for all $i$, $f_i$ is star-convex around a minimizer $x_\star$. Furthermore, we'll also assume that the subgradients are bounded in the ball $\mathcal{B}$ with center $x_\star$ and radius $\|x_0 - x_\star\|$, that is, we have $\|\nabla f_i(x)\|\leq G$ for all $i$ and $x \in \mathcal{B}$. Then, SGD-SPS converges in expected function error at a rate of at least $O(1/\sqrt{T})$. That is, after $T$ steps we have
\begin{equation}
\min_{t=0, \ldots, T-1} \,\EE f(x_t) - f(x_\star) \leq \frac{G}{\sqrt{T}}\|x_0 - x_\star\| \,.
\end{equation}
</p>
<div class="proof">
<p>
The proof mostly follows that of Garrigos et al. 2023,<dt-cite key="garrigos2023function"></dt-cite> with the difference that the global Lipschitzness assumption is relaxed to locally bounded subgradients.<dt-note>This is important, as some losses such as the squared loss has locally bounded gradients but is not globally Lipschitz</dt-note>
</p>
<p>
I've structured the proof into three parts. 1️⃣ first we'll establish that the iterates are bounded. This will allow us to avoid making a global Lipschitz assumption. 2️⃣ then we'll derive a key inequality that relates the error at two subsequent iterations. In 3️⃣ we'll sum this inequality over iterations, some terms will cancel out, and we'll get the desired convergence rate.
</p>
<p>
1️⃣ <b>Error is monotonically decreasing.</b> We'd like to show that the iterate error $\|x_t - x_\star\|^2$ is monotonically decreasing in $t$. Let's consider first the case $\|\nabla f_i(x_t)\| \neq 0$, where $i$ denotes the random index selected at iteration $t$.
Then using the definition of $x_{t+1}$ and expanding the square we have
\begin{align}
&\|x_{t+1} - x_\star\|^2 = \|x_t - x_\star\|^2 - 2 \stepsize_t \langle \nabla f_i(x_t), x_t - x_\star\rangle + \stepsize_t^2 \|\nabla f_i(x_t)\|^2 \nonumber\\
&\quad\stackrel{\text{(star convexity)}}{\leq} \|x_t - x_\star\|^2 - 2 \stepsize_t (f_i(x_t) - f_i(x_\star)) + \stepsize_t^2 \|\nabla f_i(x_t)\|^2 \nonumber\\
&\quad\stackrel{\text{(definition of $\stepsize_t$)}}{=} \|x_t - x_\star\|^2 - \frac{(f_i(x_t) - f_i(x_\star))_+^2}{\|\nabla f_i(x_t)\|^2} \label{eq:last_decreasing}
\end{align}
This last equation shows that the error is monotonically decreasing whenever $\|\nabla f_i(x_t)\| \neq 0$.
</p>
<p>
Let's now consider the case $\|\nabla f_i(x_t)\| = 0$. In this case, the step-size is 0 by our definition of $\stepsize_t$, and so the error is constant. We have established that the error is monotonically decreasing in both cases.
</p>
<p>
2️⃣ <b>Key recursive inequality.</b> Let's again first consider the case $\|\nabla f_i(x_t)\| \neq 0$. Since we've established that the error is monotonically decreasing, the iterates remain in the ball centered at $x_\star$ with radius $\|x_0 - x_\star\|$. We can then use the locally bounded gradients assumption on \eqref{eq:last_decreasing} to obtain
\begin{equation}\label{eq:recursive_G}
\|x_{t+1} - x_\star\|^2 \leq \|x_t - x_\star\|^2 - \frac{1}{G^2}(f_i(x_t) - f_i(x_\star))_+^2 \,.
\end{equation}
</p>
<p>
Let's now consider the case in which $\|\nabla f_i(x_t)\| = 0$. In this case, because of star-convexity, we have $f_i(x_t) - f_i(x_\star) \leq 0$ and so $(f_i(x_t) - f_i(x_\star))_+ = 0$. Hence inequality \eqref{eq:recursive_G} is also trivially verified. We have established that \eqref{eq:recursive_G} holds in all cases.
</p>
<p>
Let $\EE_t$ denote the expectation conditioned on all randomness up to iteration $t$. Then
taking expectations on both sides of the previous inequality and using Jensen's inequality on the convex function $z \mapsto z^2_{+}$ we have
\begin{align}
\EE_t\|x_{t+1} - x_\star\|^2 &\leq \|x_t - x_\star\|^2 - \frac{1}{G^2} \EE_t(f_i(x_t) - f_i(x_\star))_+^2 \\
&\stackrel{\mathclap{\text{(Jensen's)}}}{\leq} \|x_t - x_\star\|^2 - \frac{1}{G^2} (f(x_t) - f(x_\star))^2 \,,
\end{align}
where we have dropped the positive part in the last term, which is non-negative by definition of $x_\star$.
Finally, taking full expectations on both sides, using the tower property of expectation and rearranging we have our key recursive inequality:
\begin{equation}\label{eq:key_inequality}
\boxed{\vphantom{\sum_a^b} \EE(f(x_t) - f(x_\star))^2 \leq G^2(\EE\|x_t - x_\star\|^2 - \EE\|x_{t+1} - x_\star\|^2)}
\end{equation}
</p>
<p>
3️⃣ <b>Telescoping and final rate.</b> Using Jensen's inequality once more, we have
\begin{align}
\min_{t=0, \ldots, T-1}\,(\EE f(x_t) - f(x_\star))^2~&\stackrel{\mathclap{\text{(Jensen's)}}}{\leq}~\min_{t=0, \ldots, T-1} \EE(f(x_t) - f(x_\star))^2\\
&\leq \frac{1}{T}\sum_{t=0}^{T-1} \EE(f(x_t) - f(x_\star))^2 \\
&\stackrel{\eqref{eq:key_inequality}}{\leq} \frac{G^2}{T} \|x_0 - x_\star\|^2
\end{align}
Taking square roots on both sides gives
\begin{equation}
\sqrt{\min_{t=0, \ldots, T-1}\,(\EE f(x_t) - f(x_\star))^2} \leq \frac{G}{\sqrt{T}} \|x_0 - x_\star\|\,.
\end{equation}
Finally, by the monotonicity of the square root, we can bring the square root inside the $\min$ to obtain the desired result.
</p>
</div>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing as
</p>
<blockquote>
<a href="http://fa.bianp.net/blog/2023/sps/">Stochastic Polyak Step-size, a simple step-size tuner with optimal rates</a>, Fabian Pedregosa, 2023
</blockquote>
<p>
with bibtex entry:
</p>
<pre>
<code style="width: 100%">
@misc{pedregosa2023sps,
title={Stochastic Polyak Step-size, a simple step-size tuner with optimal rates},
author={Fabian Pedregosa},
howpublished = {\url{http://fa.bianp.net/blog/2023/sps/}},
year={2023}
}
</code>
</pre>
<h3>Acknowledgments</h3>
<p>
Thanks to <a href="https://gowerrobert.github.io/">Robert Gower</a> for first bringing this proof to my attention, and then for answering all my questions about it. Thanks also to <a href="https://vroulet.github.io/">Vincent Roulet</a> for proof reading this post and making many useful suggestions.
</p>
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
On the Convergence of the Unadjusted Langevin Algorithm2023-06-14T00:00:00+02:002023-06-14T00:00:00+02:00<a href='http://fa.bianp.net/pages/about.html'>Fabian Pedregosa</a>tag:fa.bianp.net,2023-06-14:/blog/2023/ulaq/
<p>
The Langevin algorithm is a simple and powerful method to sample from a probability distribution. It's a key ingredient
of some machine learning methods such as diffusion models and differentially private
learning.
In this post, I'll derive a simple convergence analysis of this method in the special case when the …</p>
<p>
The Langevin algorithm is a simple and powerful method to sample from a probability distribution. It's a key ingredient
of some machine learning methods such as diffusion models and differentially private
learning.
In this post, I'll derive a simple convergence analysis of this method in the special case when the target
distribution is a Gaussian distribution.
This analysis will reveal some surprising properties of this algorithm. For example, we'll show that the iterates don't converge to the target distribution, but instead converge to a distribution whose distance to the target distribution is proportional to the step-size.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{ryffel2022differential,
title={Differential Privacy Guarantees for Stochastic Gradient Langevin Dynamics},
url={https://arxiv.org/pdf/2201.11980.pdf},
author={Ryffel, Théo and Bach, Francis and Pointcheval, David},
journal={arXiv preprint},
year={2022}
}
@article{chourasia2021differential,
title={Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent},
author={Chourasia, Rishav and Ye, Jiayuan and Shokri, Reza},
journal={Advances in Neural Information Processing Systems},
url={https://arxiv.org/pdf/2102.05855.pdf},
year={2021}
}
@article{brown2022does,
title={What Does it Mean for a Language Model to Preserve Privacy?},
author={Brown, Hannah and Lee, Katherine and Mireshghallah, Fatemehsadat and Shokri, Reza and Tramèr, Florian},
journal={arXiv preprint},
year={2022}
}
@article{durmus2019high,
title={High-dimensional Bayesian inference via the unadjusted Langevin algorithm},
author={Durmus, Alain and Moulines, Eric},
journal={Bernoulli},
volume={25},
number={4A},
pages={2854--2882},
year={2019},
url={https://arxiv.org/pdf/1605.01559.pdf},
publisher={Bernoulli Society for Mathematical Statistics and Probability}
}
@article{song2020score,
title={Generative Modeling by Estimating Gradients of the Data Distribution},
author={Song, Yang},
journal={Blog Post},
url={https://yang-song.net/blog/2021/score/},
year={2021}
}
@article{dockhorn2021score,
title={Score-based generative modeling with critically-damped Langevin diffusion},
author={Dockhorn, Tim and Vahdat, Arash and Kreis, Karsten},
journal={ICLR},
url={https://arxiv.org/abs/2112.07068},
year={2022}
}
@article{chourasia2021differential,
title={Differential privacy dynamics of Langevin diffusion and noisy gradient descent},
author={Chourasia, Rishav and Ye, Jiayuan and Shokri, Reza},
journal={NeurIPS},
volume={34},
pages={14771--14781},
year={2021}
}
@article{ryffel2022differential,
title={Differential privacy guarantees for stochastic gradient Langevin dynamics},
author={Ryffel, Th{\'e}o and Bach, Francis and Pointcheval, David},
journal={arXiv preprint},
year={2022}
}
@article{takatsu2010wasserstein,
title={On Wasserstein geometry of Gaussian measures},
author={Takatsu, Asuka},
journal={Probabilistic approach to geometry},
url={https://arxiv.org/abs/0801.2250},
volume={57},
pages={463--472},
year={2010},
publisher={Mathematical Society of Japan}
}
@article{pedersen1972some,
title={Some operator monotone functions},
author={Pedersen, Gert K},
journal={Proceedings of the American Mathematical Society},
volume={36},
number={1},
pages={309--310},
year={1972},
url={https://www.ams.org/journals/proc/1972-036-01/S0002-9939-1972-0306957-4/S0002-9939-1972-0306957-4.pdf}
}
@article{freund2022convergence,
title={When is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint},
author={Freund, Yoav and Ma, Yi-An and Zhang, Tong},
journal={Journal of Machine Learning Research},
volume={23},
number={214},
pages={1--32},
year={2022},
url={https://arxiv.org/pdf/2110.01827.pdf}
}
@inproceedings{wibisono2018sampling,
title={Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem},
author={Wibisono, Andre},
booktitle={Conference on Learning Theory},
pages={2093--3027},
year={2018},
organization={PMLR},
url={https://arxiv.org/pdf/1802.08089.pdf}
}
@article{durmus2019analysis,
title={Analysis of Langevin Monte Carlo via convex optimization},
author={Durmus, Alain and Majewski, Szymon and Miasojedow, Blazej},
journal={The Journal of Machine Learning Research},
volume={20},
number={1},
pages={2666--2711},
year={2019},
publisher={JMLR. org},
url={https://www.jmlr.org/papers/volume20/18-173/18-173.pdf}
}
@article{dalalyan2019user,
title={User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient},
author={Dalalyan, Arnak S and Karagulyan, Avetik},
journal={Stochastic Processes and their Applications},
volume={129},
number={12},
pages={5278--5311},
year={2019},
publisher={Elsevier},
url={https://arxiv.org/pdf/1710.00095.pdf}
}
@article{li2021sqrt,
title={Sqrt (d) dimension dependence of langevin monte carlo},
author={Li, Ruilin and Zha, Hongyuan and Tao, Molei},
journal={arXiv preprint},
year={2021},
url={https://arxiv.org/abs/2109.03839}
}
@article{papyan2020traces,
title={Traces of class/cross-class structure pervade deep learning spectra},
author={Papyan, Vardan},
journal={The Journal of Machine Learning Research},
volume={21},
number={1},
pages={10197--10260},
year={2020},
publisher={JMLR},
url={https://jmlr.org/papers/v21/20-933.html}
}
@inproceedings{ghorbani2019investigation,
title={An investigation into neural net optimization via hessian eigenvalue density},
author={Ghorbani, Behrooz and Krishnan, Shankar and Xiao, Ying},
booktitle={International Conference on Machine Learning},
pages={2232--2241},
year={2019},
organization={PMLR},
url={https://arxiv.org/pdf/1901.10159.pdf}
}
@inproceedings{welling2011bayesian,
title={Bayesian learning via stochastic gradient Langevin dynamics},
author={Welling, Max and Teh, Yee Whye},
booktitle={Proceedings of the 28th international conference on machine learning (ICML-11)},
pages={681--688},
url={https://icml.cc/2011/papers/398_icmlpaper.pdf},
year={2011}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\require{bbox}
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\mu}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\DeclareMathOperator*{\diag}{\mathrm{diag}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colorepsdp}{RGB}{55,126,184}
\def\epsdp{{\color{colorepsdp}{\boldsymbol\varepsilon}}}
\definecolor{color1}{RGB}{27,158,119}
\definecolor{color2}{RGB}{217,95,2}
\definecolor{color3}{RGB}{253,205,172}
\definecolor{color4}{RGB}{203,213,232}
\def\step{{\color{color2}\gamma}}
\def\probaone{p}
\def\probatwo{q}
\definecolor{colorbias}{RGB}{117,112,179}
$$
</div>
<h2>The Unadjusted Langevin Algorithm (ULA)</h2>
<p>
The Langevin algorithm –also known as Langevin Monte Carlo and Langevin MCMC– is an algorithm for generating samples from a
probability distribution for which we have access to the gradient of its log probability. It's behind some of the recent success of generative AI such as diffusion models<dt-cite key="song2020score"></dt-cite> <dt-cite key="dockhorn2021score"></dt-cite> where generating a new sample is an instance of this algorithm on a score-based model. It's also used in the context of
differential privacy, where it's used to train models with privacy guarantees.<dt-cite key="chourasia2021differential"></dt-cite> <dt-cite key="ryffel2022differential"></dt-cite>
</p>
<p>
Given a function $f: \RR^d \to \RR$ for which we have access to its gradient $\nabla f$, and where $\int \exp(-f(x)) \dif x$ is finite, the Langevin algorithm produces a sequence of random iterates $x_0, x_1, \ldots$ with associated density function $x_0 \sim p_0, x_1 \sim p_1, \ldots$ increasingly approximates the following target distribution:
\begin{equation*}
q(x) \defas \frac{1}{Z} \exp \left( -\mathcal{f}(x) \right)\,, \quad \text{ with } Z
\defas \int_{\RR^p} \exp(-f(x)) \dif x\,.
\end{equation*}
</p>
<figure>
<span class="marginnote">At each iteration, ULA produces a sample whose distribution increasingly approximates the
target distribution. The left plot shows the iterative nature of ULA for 1-dimensional distribution distribution.
<span style="color: #1b9e77">In <b>green</b></span>, the probability density function associated with the iterates
${\color{color1}x_1 \sim p_1, x_2 \sim p_2, \ldots}$ and <span style="color: #D95F02">in <b>orange</b></span> the
target distribution ${\color{color2}\boldsymbol{q}}$. Note how the iterate's distribution approaches the target
distribution.
</span>
<img src="/images/2023/langevin_animation.png" alt="">
</figure>
<p>
And it does so in a remarkable simple way:<dt-note>Throughout this blog post we'll use the following notation. $I$ denotes the identity
matrix, $\mathcal{N}(\mu, \Sigma)$ denotes a multi-dimensional Gaussian distribution with mean $\mu$ and
covariance $\Sigma$. $\|\cdot\|$ denotes the euclidean distance on vectors while $\|\cdot\|_F$ denotes the
Frobenius norm on matrices.</dt-note>
</p>
<p class="framed">
<b class="tufte-underline">Unadjusted Langevin Algorithm (ULA)</b><br>
<b>Input</b>: starting guess $x_0 \in \RR^d$ and step-size $\step \gt 0$. <br>
<b>For</b> $t=0, 1, \ldots$ <br>
\begin{equation}\label{eq:ula}
\begin{aligned}
& \text{sample } {\color{colorepsdp}\boldsymbol{\varepsilon}_t} \sim \mathcal{N}(0, I)\\
&x_{t+1} = x_t - \step \nabla f(x_t) + \sqrt{2 \step} {\color{colorepsdp}\boldsymbol{\varepsilon}_t}
\end{aligned}
\end{equation}
</p>
<p>
The algorithm is sometimes also referred to as noisy gradient descent, because it's just that: gradient descent with some Gaussian noise added to the gradient at each iteration.
Although we won't use this perspective in this blog post, it also corresponds to the Euler-Maruyama discretization of a <a href="https://en.wikipedia.org/wiki/Langevin_equation">Langevin diffusion</a> process (hence the name).<dt-note> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Paul_Langevin_Wellcome2.jpg/186px-Paul_Langevin_Wellcome2.jpg" alt="Paul Langevin" style="display: block; margin: 0 auto; max-width: 200px; box-shadow: 6px 6px 3px grey;"> <br> <a href="https://en.wikipedia.org/wiki/Paul_Langevin">Paul Langevin</a> (1872 - 1946) was a French physicist who developed Langevin dynamics and the Langevin equation. He was a doctoral student of Pierre Curie and later a lover of widowed Marie Curie. He is also known for his two US patents with Constantin Chilowsky in 1916 and 1917 involving ultrasonic submarine detection. He is entombed at the Panthéon.</dt-note>
</p>
<p>
The Langevin algorithm admits different variants, depending on whether the step-size is constant or decreasing and
whether there's a rejection step or not.
The variant above with a constant step-size $\step$ and no rejection step is the most commonly used in practice and is often referred to as the <span class="tufte-underline">unadjusted</span> Langevin algorithm (ULA). Although I won't cover them in this post, it's worth noting that there are also <i>stochastic</i> variants of the algorithm, where $\nabla f(x_t)$ is replaced with a stochastic unbiased estimator.<dt-cite key="welling2011bayesian"></dt-cite>
</p>
<p>
The goal of this post is to characterize the speed of convergence of the iterates towards this target distribution. We'll identify the key properties that influence this convergence, and see a few surprises along the way.
</p>
<h2>The Difficulty of Analyzing the Langevin Algorithm</h2>
<p>
One of the key differences between the analysis of optimization and that of sampling algorithms is that in sampling
algorithms, we're not looking to bound the distance between the iterates and the optimal solution. Instead, we're
looking to bound the distance between the iterates' <i>distribution</i> and the target distribution. This has far-reaching consequences. For once, distances between distributions are
often difficult to compute. As an example, the squared Wasserstein (also known as Kantorovich-Rubinstein) $W_2(, )^2$ distance between probability distributions
$\probaone$ and $\probatwo$ is defined as
\begin{equation}
W_2(\probaone, \probatwo)^2 \defas \inf_{\pi \in \Pi(\probaone, \probatwo)} \EE_{(x,
y) \sim \pi} {\|x - y\|}^2\,,
\end{equation}
where $\Pi(\probaone, \probatwo)$ is the set of all couplings between $\probaone$ and $\probatwo$.<dt-note>A coupling between two probability distributions $p$ and $q$ is a third distribution $\pi$ in the product space that has $p$ and $q$ as marginals. </dt-note> Computing the Wasserstein hence involves solving an optimization problem over the set
of couplings.
</p>
<p>
Fortunately, computing the Wasserstein distance of Gaussian distributions admits an explicit formula. When
$\probaone$ and $\probatwo$ are both Gaussian with mean $\mu_p$,
$\mu_q$ and covariance $\Sigma_p$,
$\Sigma_q$ respectively, the Wasserstein distance between them can be expressed as
by<dt-cite key="takatsu2010wasserstein"></dt-cite>
\begin{equation*}
W_2(\probaone, \probatwo)^2 = \frac{1}{2} \left( \| \mu_p -
\mu_q \|_2^2 + \tr(\Sigma_p +
\Sigma_q - 2 \sqrt{\Sigma_p
\Sigma_q}) \right)\,.
\end{equation*}
Furthermore, when the covariance matrices commute then the previous formula simplifies further to
\begin{equation}\label{eq:wasserstein_simple}
W_2(\probaone, \probatwo)^2 = \|\mu_p - \mu_q\|^2 +
\|\Sigma_p^{1/2} - \Sigma_q^{1/2}\|_{F}^2\,.
\end{equation}
</p>
<h2>It's Gaussians all the Way</h2>
<figure>
<img src="/images/2023/gauss_multiple.png" alt="">
</figure>
<p>
Throughout the rest of the post we'll assume that $f$ is a quadratic function of the form:
\begin{equation}\label{eq:quadratic_function}
f(x) \defas \frac{1}{2}(x - {\mu}_q) H (x - {\mu}_q)\,.
\end{equation}
with a positive definite matrix $H$. The associated target distribution $q(x) \propto \exp \left( -\mathcal{f}(x) \right)$ is then a
Gaussian distribution with mean ${\mu}_q$ and covariance $H^{-1}$.
</p>
<p>
Similar to the analysis of optimization methods <a href="/blog/2020/polyopt/">on quadratics</a>, the analysis of ULA simplifies considerably when the target measure is Gaussian, as in this case each step of the algorithm is equivalent to a Gaussian random walk.
Assuming that $x_0$ is sampled from a Gaussian distribution, the first iterate is then a
linear combination of two Gaussian random variable, which is again a Gaussian random variable. For the same reason, any future iterate is also a Gaussian random variable.
</p>
<h2>Unroll all the things!</h2>
<p>
One advantage of working with Gaussian target measures is that we can write down any iterate in a simple non-recursive formula. In this case the gradient $\nabla f(x) = H (x - \mu_q)$ is an affine function of the parameters, which allows us to <q>unroll</q> the iterates as
\begin{align}
x_{t} - \mu_q &= (I - \step H) (x_{t-1} - \mu_q) + \sqrt{2 \step}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{t-1}}\\
&= (I - \step H) \left( (I - \step H) (x_{t-2} - \mu_q) + \sqrt{2 \step}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{t-2}}\right) + \sqrt{2 \step}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{t-1}}\\
&= (I - \step H)^2 (x_{t-2} - \mu_q) + \sqrt{2 \step} (I - \step
H){\color{colorepsdp}\boldsymbol{\varepsilon}_{t-2}} + \sqrt{2 \step}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{t-1}} \\
&= \cdots\\
&= (I - \step H)^t (x_0 - \mu_q) + \sqrt{ 2 \step} \sum_{i=0}^{t-1} (I - \step H)^{t - 1 - i}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{i}}\,, \label{eq:last_eq_unrolling}
\end{align}
where in the first line we've used the definition of $x_t$, in the second one that of $x_{t-1}$ and so on.
</p>
<p>
We now want to characterize the distribution of the iterates. Thanks to the previous section, we know that this is a
Gaussian distribution, so it is fully determined through its mean $\mu_t$ and covariance
$\Sigma_t$. Let's estimate these two quantities separately. For simplicity, we'll assume that the initial iterate was sampled from a Gaussian distribution with scaled identity covariance, that is, $p_0 = \mathcal{N}(\mu_0, \sigma^2 I)$ for some scalar $\sigma \geq 0$.
<dt-note>This assumption can be trivially relaxed to a covariance matrix that commutes with $H$.
However, allowing for a general initial covariance matrix becomes more complicated, as the terms in the equation \eqref{eq:limit_covariance} no longer commute, resulting in very long expressions. In any case, most often the initial guess generated either deterministically (covered by our assumptions with $\sigma = 0$) or a standard Gaussian distribution (also covered).
</p>
<p>
<span class="tufte-underline">1️⃣: mean.</span>
Taking expectations on \eqref{eq:last_eq_unrolling}, all the terms in ${\color{colorepsdp}\boldsymbol{\varepsilon}_{i}}$ vanish, and we are left with
\begin{equation}\label{eq:mean_t}
\mu_t - \mu_q \stackrel{\eqref{eq:last_eq_unrolling}}{=} (I - \step H)^t (\mu_0 - \mu_q)\,,
\end{equation}
As long as the spectral radius of $I - \step H$ is smaller than $1$, the right hand side vanishes exponentially fast. Hence the formula above tells us that the iterate average $\mu_t$ converges exponentially fast to the target mean $\mu_q$.
</p>
<p>
<span class="tufte-underline">2️⃣: covariance.</span> This one's more difficult. Using the shorthand notation $x^2 = x x^\top$ for a vector $x$, we have
\begin{align}
\Sigma_t &\defas \EE (x_t - \mu_t)^2 = \EE\biggl((I - \step H)^t (x_0 - \mu_q) + \sqrt{ 2 \step} \sum_{i=0}^{t-1} (I - \step H)^{t-1-i}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{i}} - (\mu_t - \mu_q)\biggl)^2 \\
&\stackrel{\eqref{eq:mean_t}}{=} \EE\left((I - \step H)^t (x_0 - \mu_0) + \sqrt{ 2 \step} \sum_{i=0}^{t-1} (I - \step H)^{t-1-i}
{\color{colorepsdp}\boldsymbol{\varepsilon}_{i}} \right)^2 \\
&= 2 \step \sum_{i=0}^{t-1} (I - \step H)^{2i} + \sigma^2 (I - \step H)^{2t}\\
&= \underbrace{( H - \frac{\step}{2} H^2)^{-1}}_{\text{biased covariance}} + \underbrace{(I - \step H)^{2t} [\sigma^2 I - ( H - \frac{\step}{2} H^2)^{-1}]}_{\text{vanishing as $t \to \infty$}} \,, \label{eq:limit_covariance}
\end{align}
where in the third line we've used the fact that the $\boldsymbol{\varepsilon}_i$ are independent, and so the terms $\EE[{\color{colorepsdp}\boldsymbol{\varepsilon}_{i}} {\color{colorepsdp}\boldsymbol{\varepsilon}_{j}}^\top]$ vanish for $i \neq j$. In the last line we've used the formula for the partial sum of a geometric series $\sum_{i=0}^{t-1} A^i = (I - A)^{-1} - A^t (I - A)^{-1}$, with $A = (I - \step H)^2$.
</p>
<p>
As $t \to \infty$, the second term vanishes and only the first one survives. Interestingly, this last term is <i>not</i> the covariance matrix of the stationary distribution $H^{-1}$, as one would expect from a converging algorithm. Instead, the iterates' covariance converges towards $( H - \frac{\step}{2} H^2)^{-1}$, which only equals the stationary $H^{-1}$ in the limit as $\step \to 0$. Because of this mismatch, we say that <span class="tufte-underline">ULA is a biased algorithm</span>.
</p>
<h2>Convergence Rate to the Biased Limit</h2>
<p>
In the last section we've seen that the iterates of ULA converge to a Gaussian distribution with mean $\mu_q$ and covariance $( H - \frac{\step}{2} H^2)^{-1}$. We'll now quantify the speed of this convergence.
</p>
<p>
As is customary for optimization results, this convergence rate will depend on the extremal eigenvalues of the Hessian's loss (which in this case is also the target distribution's precision matrix). Let $\ell$ and $L$ denote a lower and upper bound on $H$'s eigenvalues respectively. With this, we have the following result.
</p>
<p class="lemma framed" text="convergence rate on the biased limit">
Let $p_t$ denote the distribution of the iterates of ULA with step-size $\step$ on a quadratic objective function, where the initial guess $p_0$ is a Gaussian distribution with covariance $\sigma^2 I$, $\sigma \leq L(1 + \frac{\step}{2}L)$. <br><br> Then, ULA converges exponentially fast in the Wasserstein distance towards $p_{\step}$, a Gaussian distribution with mean $\mu_q$ and covariance $( H - \frac{\step}{2} H^2)^{-1}$. More precisely, for any $\step \lt 2/L$ we have
\begin{align}
W_2(p_t, p_{\step}) &\leq \left(\max\{|1 - \step L|, |1 - \step \ell|\}\right)^{t} W_2(p_0, p_{\step})\,. \label{eq:convergence_rate}
\end{align}
</p>
<div class="wrap-collabsible"> <input id="collapsible3" class="toggle" type="checkbox"> <label for="collapsible3" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof" id="proof-variance">
<p>
From the formula for the Wasserstein distance between Gaussian distributions \eqref{eq:wasserstein_simple}, the total distance decomposes into the sum of the distance of the means and that of its square root covariance. We'll estimate these two terms separately.
</p>
<p>
<span class="tufte-underline">1️⃣: distance between means.</span> For this distance we have
\begin{align}
\|\mu_t - \mu_q\|^2 &\stackrel{\eqref{eq:mean_t}}{=} \|(I - \step H)^t (\mu_0 - \mu_q) \|^2 \\
&\leq \max\{|1 - \step L|, |1 - \step \ell|\}^{2t} \|\mu_0 - \mu_q\|^2 \\
\end{align}
where the last line follows by Cauchy-Schwartz.
</p>
<p>
<span class="tufte-underline">2️⃣: distance between covariances.</span>
Let $Q \diag(h_1, \ldots, h_d) Q^\top$ be the eigendecomposition of $H$, where $h_1, h_2, \ldots$ are the eigenvalues of $H$ and $Q$ is an orthonormal matrix. We'll first show that both $(H - \frac{\step}{2} H^2)^{-1}$ and $\Sigma_t$ are diagonal in the same basis as this will simplify the computation of the Frobenius norm of their difference.
</p>
<p>
Replacing $H$ by its eigendecomposition in the formula of the biased covariance, we have that this last one admits the following eigendecomposition:
\begin{align}
(H - \frac{\step}{2} H^2)^{-1} &= Q \diag(\frac{1}{h^\step_1}, \ldots, \frac{1}{h^\step_d}) Q^\top \,,
\end{align}
with $h_i^\step \defas h_i - \frac{\step}{2}h_i^2$.
Also replacing $H$ by its eigendecomposition in \eqref{eq:limit_covariance}, we obtain that $\Sigma_t$ admits the eigendecomposition
\begin{align}
\Sigma_t = Q \bigl(\diag(\frac{1}{h^\step_1} - z_1, \ldots, \frac{1}{h^\step_d} - z_d)\bigl)Q^\top \,,
\end{align}
with $z_i \defas (1 - \step h_i)^{2 t}(\frac{1}{h_i^\step} - \sigma^2)$.
</p>
<p>
Now using the fact that the squared Frobenius norm of a matrix is the sum of its eigenvalues, we can write the distance between the two covariances in terms of $h_i^\step$ and $z_i$:
\begin{align}
&\|\Sigma_t^{1/2} - ( H - \frac{\step}{2} H^2)^{-1/2}\|_F^2 \nonumber\\
&\quad= \sum_{i=1}^d \biggl(\sqrt{\frac{1}{h^\step_i} - z_i} - \sqrt{\frac{1}{h^{\step}_i}}\biggl)^2 \\
&\quad= \sum_{i=1}^d \frac{1}{h^\step_i}\biggl(2 - h_i^\step z_i - 2 \sqrt{1 - h_i^\step z_i} \biggl) \label{eq:expanding_square}\\
&\quad\leq \sum_{i=1}^d \frac{1}{h^\step_i}\biggl(2 - h_i^\step z_i - 2 (1 - h_i^\step z_i) \biggl) \label{eq:sqrt_inequality}\\
&\quad= \sum_{i=1}^d z_i = \sum_{i=1}^d (1 - \step h_i)^{2 t}(\frac{1}{h^\step_i} - \sigma^2)\\
&\quad\leq \max\{|1 - \step L|, |1 - \step \ell|\}^{2t} \underbrace{\sum_{i=1}^d (\frac{1}{h^\step_i} - \sigma^2)}_{= \|(H - \frac{\step}{2} H^2)^{-1} - \sigma^2 I\|_F^2}\,,
\end{align}
where in \eqref{eq:expanding_square} we have expanded the square and in \eqref{eq:sqrt_inequality} we have used the inequality $\sqrt{1 - x} \geq 1 - x$ for $x \in [0, 1]$, together with the assumption $\sigma^2 \leq h^\step_i$.
</p>
<p>
Finally, we sum the bound in the distance between means and the bound in the distance between covariances to obtain
\begin{equation}
W_2(p_0, p_{\step})^2 \leq \max\{|1 - \step L|, |1 - \step \ell|\}^{2t} \underbrace{\bigl(\|\mu_0 - \mu_q\|^2 + \|(H - \frac{\step}{2} H^2)^{-1} - \sigma^2 I\|_F^2\bigl)}_{=W_2(p_0, p_{\step})^2}\,.
\end{equation}
The final result then comes from taking the square root on both sides.
</p>
</div></div></div></div>
<h2>Convergence Rate to the Stationary Distribution</h2>
<p>
The rate in the last lemma unfortunately only bounds the distance to the <i>biased limit</i>. However, usually we'll instead want the distance to the target distribution since that's the distribution we want to sample from. The Wasserstein distance (as any distance) satisfies the triangular inequality. We can then use this inequality to bound on the distance to the target distribution:
\begin{equation}
W_2(p_t, q) \leq W_2(p_t, p_{\step}) + W_2(p_{\step}, q)\,.
\end{equation}
Of the two terms in the right hand side, the first one is already bounded by the previous lemma. The second one is the distance between the biased limit and the target distribution. By bounding this last term we can achieve our goal, which was to bound the distnace between $p_t$ and the target distribution.
As in the previous Lemma, we denote by $\ell$ and $L$ a lower and upper bound on $H$'s eigenvalues respectively.
</p>
<p class="theorem framed" id="cur">
Let $p_t$ denote the distribution of the iterates of ULA with step-size $\step$ on a quadratic objective function, where the initial guess $p_0$ is a Gaussian distribution with covariance $\sigma^2 I$, $\sigma \leq L(1 + \frac{\step}{2}L)$. <br><br> Then, for any step-size $\step \lt 2/L$, the Wasserstein distance between $p_t$ and the target distribution $q$ can be bounded by a sum of two terms, of which the first one vanishes exponentially fast in $t$, while the second one is $\mathcal{O}(\step)$ close to the target distribution. More precisely, at every iteration $t$ we have
\begin{equation}
W_2(\probaone_t, \probatwo) \leq \underbrace{\rho^{t}\, W_2(p_0, p_{\step})\vphantom{\frac{1}{2}}}_{\text{exponential convergence}} + \underbrace{\frac{\step}{4}\sqrt{\tr(H)}}_{\text{stationary}} \,,
\end{equation}
with linear rate factor $\rho \defas \max\{|1 - \step L|,|1 - \step \ell|\}$.
</p>
<div class="wrap-collabsible"> <input id="collapsible4" class="toggle" type="checkbox"> <label for="collapsible4" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof" id="proof-full-rate">
<p>
As in the proof of the previous lemma, we denote the eigenvalues of $H$ by $h_1, h_2, \ldots$.
</p>
<p>
We'll now bound the distance between $p_{\step}$ and the target distribution $q$. Both distributions are Gaussians with mean $\mu_q$, so their Wasserstein distance is the Frobenius norm of the difference of their square root covariances \eqref{eq:wasserstein_simple}. Then we have:
\begin{align}
W_2(q, p_{\step})^2 &= \|H^{-1/2} - ( H - \frac{\step}{2} H^2)^{-1/2}\|_F^2 \\
&= \sum_{i=1}^d \frac{1}{h_i}\biggl(1-\frac{1}{\sqrt{1 + \frac{1}{2} \step h_{i}}}\biggl)^2\,.\label{eq:wasserstein_bias_eigvals}
\end{align}
Let's now consider the function $\varphi(z) = (1 - \frac{1}{\sqrt{1 + z}})^2$ for $z$ in the $[0, 1]$ interval.<dt-note>This corresponds to the expression inside the parenthesis above as a function of $z = \step h_i$. Note that the assumption $\step \lt 2 / L$ implies $\frac{1}{2}\step h_i \leq 1$ for all $h_i$.</dt-note> From a first-order Taylor expansion with remainder we obtain the following equivalent expression for this function:
\begin{equation}
\varphi(z) \defas (1 - \frac{1}{\sqrt{1 + z}})^2 = \underbrace{\varphi(0)}_{=0} + \underbrace{\varphi'(0)}_{=0} z + \varphi''(\xi) \frac{z^2}{2}
\end{equation}
for some $\xi \in [0, 1]$. At the same time, we have that $\varphi''(z) = \frac{4-3 \sqrt{z+1}}{2 (z+1)^3}$ is a monotonically decreasing function, so we can upper bound it through $\varphi''(0) = \frac{1}{2}$.<dt-note>Through a second-order Taylor expansion we have upper bounded the function $\varphi(z) = (1 - \frac{1}{\sqrt{1 + z}})^2$ (blue) with $\frac{z^2}{4}$ (orange) <br><img src="/images/2023/comparison_taylor_expansion0.png" alt=""></dt-note> Therefore, we have that
\begin{equation}
\varphi(z) \leq \frac{z^2}{4} \quad \text{for all } z \in [0, 1]\,.
\end{equation}
In particular, this holds for $z = \frac{1}{2} \step h_i$, where we have
\begin{equation}
\varphi(\frac{1}{2} \step h_i) = \bigl(1-\frac{1}{\sqrt{1 + \frac{1}{2} \step h_{i}}}\bigl)^2 \leq \frac{\step^2 h_i^2}{16} \,.\label{eq:bound_varphi}
\end{equation}
</p>
<p>
We can now use this bound on $\varphi$ to upper bound the terms inside the parenthesis of \eqref{eq:wasserstein_bias_eigvals} as follows:
\begin{align}
W_2(q, p_{\step})^2 &\stackrel{\eqref{eq:wasserstein_bias_eigvals}}{=} \sum_{i=1}^d \frac{1}{h_i}\biggl(1-\frac{1}{\sqrt{1 + \frac{1}{2} \step h_{i}}}\biggl)^2 \\
&\stackrel{\eqref{eq:bound_varphi}}{\leq} \frac{\step^2}{16} \underbrace{\sum_{i=1}^d h_i}_{= \tr(H)}
\end{align}
</p>
<p>
Finally, taking the square root of the bound above and combining it with the previous lemma we get the desired result.
</p>
</div></div></div></div>
<p>
The result above shows how the convergence rate of Langevin can be split into a sum of two terms, the first one being an exponentially fast convergence and the other is stationary. Although this fact is widely known, the result above seems to be somewhat new in that it doesn't have have higher order terms in $\step$ as (Wibisono 2018).<dt-cite key="wibisono2018sampling"></dt-cite>
</p>
<p>
The formula above establishes convergence for any step-size $\gamma \lt 2/L$. However, for most except the very large step-sizes it can be further simplified. For example, for $\step \leq 2 / (L + \ell)$ we have that $|1 - \step \ell| \geq |1 - \step L|$. In this case, the $\max$ in the Theorem can simplifies to $ \max\{|1 - \step L|,|1 - \step \ell|\} = 1 - \step \ell$ <dt-note>For small step-sizes, the maximum between $|1 - \step L|$ and $|1 - \ell \step|$ is achieved for $|1 - \ell \step|$. Example below of these two functions with $L=1, \ell=0.1$. <br> <img src="/images/2023/rate_factor_comparison.png" alt=""></dt-note> and so we can simplify the rate as per the following Corollary.
</p>
<p class="corollary">
Under the same assumptions of the theorem above, for any step-size $\step \leq 2 / (L + \ell)$ we have
\begin{equation}
W_2(\probaone_t, \probatwo) \leq \vphantom{\frac{1}{2}}(1 - \step \ell)^{t} W_2(p_0, p_{\step}) + \frac{\step}{4}\sqrt{\tr(H)} \,.
\end{equation}
</p>
<p>
The formula above shows how the convergence of the Langevin algorithm can be split onto the sum of two terms, one of
which is exponentially decreasing, and the other one is a non-zero bias term. In particular this shows that <span class="uderline">ULA is a biased algorithm</span>, even in the case of Gaussian target distributions.
</p>
<p>
The bias term is proportional to the step-size.
In particular, the bias vanishes when the step-size is zero but is
positive otherwise. This is one defining difference between gradient descent and the ULA algorithm: in gradient
descent, a non-zero step-size still takes you to the right minimizer, while ULA will be biased. If one wishes to eliminate the bias, one could use a decreasing step-size or to change the algorithm to include a rejection step –the so-called <a href="https://en.wikipedia.org/wiki/Metropolis-adjusted_Langevin_algorithm">Metropolis-adjusted Langevin algorithm (MALA)</a>.
</p>
<h2>To know more</h2>
<p>
The literature on the analysis of ULA is vast, so I won't cover it all but merely point to some of my favorite papers. For any notable omission, please leave a comment below!
</p>
<p>
Early works that derive convergence rates for ULA on convex and smooth functions (that is, more general than the quadratic setting of this post) include those of Durmus et al.<dt-cite key="durmus2019analysis"></dt-cite> and Dalalyan et al.<dt-cite key="dalalyan2019user"></dt-cite>
</p>
<p>
A very insightful work that already derived an asymptotic $\frac{\step}{4}\sqrt{\tr(H)} + \mathcal{O}(\step^2)$ bias of Langevin on quadratics is that of Wibisono.<dt-cite key="wibisono2018sampling"></dt-cite>
The Theorem above improves upon the bounds of Wibisono et al. by avoiding higher order terms $\step$ in the rate and by providing non-asymptotic rates.
</p>
<p>
Another aspect worth mentioning is that most analyses of ULA bound the bias in terms of the problem dimensionality,<dt-cite key="li2021sqrt"></dt-cite> while we bound it in terms of the trace of the Hessian. In the worst-case, the square root of the trace is $\sqrt{L d}$, but in many cases it is much smaller. For example, it's known that for large machine learning models most of the eigenvalues are close to zero.<dt-cite key="ghorbani2019investigation"></dt-cite> <dt-cite key="papyan2020traces"></dt-cite>
As far as I know, the only work that analyzes ULA for general smooth and convex functions and has a bias term that depends on the trace of the Hessian<dt-note>Their rate depends $\tr(H^2)$ instead of $\sqrt{\tr(H)}$ as the rates above.</dt-note> is Theorem 6 in (Freund et al. 2022).<dt-cite key="freund2022convergence"></dt-cite> However, their bound is on the KL divergence (instead of the Wasserstein distance) so they are not directly comparable. I believe it's still an open problem whether it's possible to carry out such analysis for the Wasserstein distance.
</p>
<h2>Citing</h2>
<p>
</p>
<h3>Acknowledgements</h3>
<p>
Thanks to <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a> for reporting numerous typos and to <a href="https://cypaquette.github.io/">Courtney Paquette</a>, <a href="https://scholar.google.com/citations?user=-tEiRFcAAAAJ&hl=en">James Harrison</a> and <a href="https://sites.google.com/corp/view/sp-monte-carlo/">Sam Power</a> for feedback on the blog post.
</p>
<br><br>
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
The Russian Roulette: An Unbiased Estimator of the Limit2022-10-15T00:00:00+02:002022-10-15T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2022-10-15:/blog/2022/russian-roulette/
<blockquote class="pullquote" style="margin-left: 20px">
<p>
<q>The idea for what was later called Monte Carlo method occurred to me when I was playing solitaire during my illness.</q>
</p>
<p style="text-align: right;">
Stanislaw Ulam, <i><a href="https://www.goodreads.com/book/show/423246.Adventures_of_a_Mathematician">Adventures of a Mathematician</a></i>
</p>
</blockquote>
<p>
The Russian Roulette offers a simple way to construct an unbiased estimator for the limit of a sequence. It allows for example to …</p>
<blockquote class="pullquote" style="margin-left: 20px">
<p>
<q>The idea for what was later called Monte Carlo method occurred to me when I was playing solitaire during my illness.</q>
</p>
<p style="text-align: right;">
Stanislaw Ulam, <i><a href="https://www.goodreads.com/book/show/423246.Adventures_of_a_Mathematician">Adventures of a Mathematician</a></i>
</p>
</blockquote>
<p>
The Russian Roulette offers a simple way to construct an unbiased estimator for the limit of a sequence. It allows for example to construct an unbiased estimator of the pseudoinverse of a matrix, which is otherwise difficult to obtain. We'll first show that the estimator is unbiased. Then we'll discuss one of the original applications of this method: an unbiased estimator of the matrix pseudoinverse. Finally, we'll discuss its limitations and practical issues through a variance analysis.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{kahn1955use,
title={Use of different Monte Carlo sampling techniques},
author={Kahn, Herman},
year={1955},
url={https://www.rand.org/pubs/papers/P766.html},
publisher={Rand Corporation},
journal={Rand Corporation}
}
@article{forsythe1950matrix,
title={Matrix inversion by a Monte Carlo method},
author={Forsythe, George E and Leibler, Richard A},
journal={Mathematics of Computation},
url= {https://www.ams.org/journals/mcom/1950-04-031/S0025-5718-1950-0038138-X/S0025-5718-1950-0038138-X.pdf},
volume={4},
number={31},
pages={127--129},
year={1950}
}
@inproceedings{arvo1990particle,
title={Particle transport and image synthesis},
author={Arvo, James and Kirk, David},
booktitle={Proceedings of the 17th annual conference on Computer graphics and interactive techniques},
url={https://www.cs.princeton.edu/courses/archive/fall04/cos526/papers/arvo90.pdf},
pages={63--66},
year={1990}
}
@article{tallec2017unbiasing,
title={Unbiasing truncated backpropagation through time},
author={Tallec, Corentin and Ollivier, Yann},
journal={arXiv preprint arXiv:1705.08209},
url={https://arxiv.org/pdf/1705.08209.pdf},
year={2017}
}
@article{beatson2019efficient,
title={Efficient optimization of loops and limits with randomized telescoping sums},
author={Beatson, Alex and Adams, Ryan P},
journal={Proceedings of the 36 th International Conference on Machine Learning},
url={https://arxiv.org/pdf/1905.07006.pdf},
year={2019}
}
@book{ulam1991adventures,
title={Adventures of a Mathematician},
author={Ulam, Stanislaw M},
year={1991},
url={https://www.ucpress.edu/book/9780520071544/adventures-of-a-mathematician},
publisher={University of California Press}
}
@incollection{hendricks1985mcnp,
title={MCNP variance reduction overview},
author={Hendricks, JS and Booth, TE},
booktitle={Monte-Carlo Methods and Applications in Neutronics, Photonics and Statistical Physics},
url={https://link.springer.com/chapter/10.1007/BFb0049037?noAccess=true},
pages={83--92},
year={1985},
publisher={Springer}
}
@incollection{carter1975particle,
title={Particle-transport simulation with the Monte Carlo method},
author={Carter, Leland Lavele and Cashwell, Edmond Darrell},
year={1975},
booktitle={Scientific Report},
url={https://www.osti.gov/biblio/4167844},
institution={Los Alamos Scientific Lab., N. Mex.(USA)}
}
@article{lu2012monte,
title={Monte carlo matrix inversion policy evaluation},
author={Lu, Fletcher and Schuurmans, Dale},
journal={UAI2003},
url={https://arxiv.org/pdf/1212.2471.pdf},
year={2012}
}
@article{liang2020general,
title={A general-purpose Monte Carlo particle transport code based on inverse transform sampling for radiotherapy dose calculation},
author={Liang, Ying and Muhammad, Wazir and Hart, Gregory R and Nartowt, Bradley J and Chen, Zhe J and Yu, James B and Roberts, Kenneth B and Duncan, James S and Deng, Jun},
journal={Scientific reports},
volume={10},
number={1},
pages={1--18},
year={2020},
publisher={Nature Publishing Group},
url={https://doi.org/10.1038/s41598-020-66844-7}
}
@article{lyne2015russian,
title={On Russian Roulette estimates for Bayesian inference with doubly-intractable likelihoods},
author={Lyne, Anne-Marie and Girolami, Mark and Atchadé, Yves and Strathmann, Heiko and Simpson, Daniel},
journal={Statistical science},
volume={30},
number={4},
pages={443--467},
year={2015},
publisher={Institute of Mathematical Statistics},
url={https://doi.org/10.1214/15-STS523}
}
@article{von1956probabilistic,
title={Probabilistic logics and the synthesis of reliable organisms from unreliable components},
author={Von Neumann, John},
journal={Automata studies},
volume={34},
number={34},
pages={43--98},
year={1956},
publisher={Princeton},
url={https://static.ias.edu/pitp/archive/2012files/Probabilistic_Logics.pdf}
}
@inproceedings{xu2019variational,
title={Variational russian roulette for deep bayesian nonparametrics},
author={Xu, Kai and Srivastava, Akash and Sutton, Charles},
booktitle={International Conference on Machine Learning},
url={http://proceedings.mlr.press/v97/xu19e/xu19e.pdf},
year={2019},
organization={PMLR}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\HH{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\HH{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}w
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\ell}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colorstochastic}{RGB}{27, 158, 119}
\definecolor{colorstepsize}{RGB}{217, 95, 2}
\DeclareMathOperator{\Proba}{Proba}
\def\Yhat{{\color{colorstochastic}\hat{Y}}}
\newcommand{\Var}{\operatorname{Var}}
\definecolor{color1}{RGB}{127,201,127}
\definecolor{color2}{RGB}{179,226,205}
\definecolor{color3}{RGB}{253,205,172}
\definecolor{color4}{RGB}{203,213,232}
$$
</div>
<h2>Von Neumann, Ulam and the Manhattan Project</h2>
<p>
The Russian Roulette estimator was invented by John von Neumann<dt-note><img style="display: block; margin: 0 auto; max-width: 200px; box-shadow: 6px 6px 3px grey;" src="/images/2021/JohnvonNeumann-LosAlamos.gif" alt=""> <br><a href="https://en.wikipedia.org/wiki/John_von_Neumann">John von Neumann</a> (1903-1957) was a Hungarian-American mathematician, physicist, computer scientist, engineer and polymath. He's widely regarded as inventing stochastic computing in <sup style="color: #f00000">2</sup>. Interestingly, his theory could not be implemented until 10 years later with advances in computing.</dt-note> <dt-cite key="von1956probabilistic"></dt-cite> and Stanislaw Ulam<dt-note><img style="display: block; margin: 0 auto; max-width: 200px; box-shadow: 6px 6px 3px grey;" src="/images/2021/ulam.jpg" alt=""> <br> <a href="https://en.wikipedia.org/wiki/Stanislaw_Ulam">Stanisław Marcin Ulam</a> (1909 – 1984) was a Polish-American scientist in the fields of mathematics and nuclear physics. He participated in the Manhattan Project, originated the Teller-Ulam design of thermonuclear weapons, discovered the concept of the cellular automaton, invented the Monte Carlo method of computation, and suggested nuclear pulse propulsion. </dt-note> during the 1940s in the context of the Manhattan project.<dt-note><img style="display: block; margin: 0 auto; max-width: 200px; box-shadow: 6px 6px 3px grey;" src="https://upload.wikimedia.org/wikipedia/commons/a/aa/Nagant_Revolver.jpg" alt="revolver"> <br><br> The name of this estimator comes from the <a href="https://en.wikipedia.org/wiki/Russian_roulette">Russian Roulette</a> game of chance, in which a player places a single round in a revolver and shoots at himself. As in the deadly game, the estimator decides through a game of chance whether to continue or stop or <q>continue playing</q>. </dt-note> However, because of the secrecy of the project, they never published it.
</p>
<p>
The information we have about their discovery is second-hand from colleagues that built upon and credited von Neumann and Ulam them the original discovery. One of the first applications of this technique was to design an unbiased estimator of the matrix pseudoinverse, which we'll discuss in <a href="#sec3">section 3</a>. This application was published by George Forsythe, and the paper starts by crediting Von Neumann and Ulam for the method: <q>The following unusual method of inverting a class of matrices was devised by J. von Neumann and S. M. Ulam.</q><dt-cite key="forsythe1950matrix"></dt-cite>
</p>
<p>
Since Ulam and von Neumann's discovery, the method has found applications in fields as diverse as
computer graphics<dt-cite key="arvo1990particle"></dt-cite> and particle physics.<dt-cite key="carter1975particle"></dt-cite> <dt-cite key="hendricks1985mcnp"></dt-cite> <dt-cite key="liang2020general"></dt-cite>
In machine learning, it has found application in the optimization of recurrent neural netwoks,<dt-cite key="tallec2017unbiasing"></dt-cite> Bayesian inference,<dt-cite key="lyne2015russian"></dt-cite> Bayesian nonparametrics,<dt-cite key="xu2019variational"></dt-cite> reinforcement learning,<dt-cite key="lu2012monte"></dt-cite> and implicit differentiation.<dt-cite key="beatson2019efficient"></dt-cite>
</p>
<h2>The Russian Roulette Estimator</h2>
<p>
Consider a sequence $\{Y_1, Y_2, \ldots\}$ with finite limit $Y_{\infty}$. How can we estimate the limit after having seen only a finite number of elements?.
</p>
<p>
The Russian Roulette is an estimator that allows to do precisely this: estimate the limit of a (potentially infinite) sequence using finite computation, thanks to the magic of randomness.
</p>
<p>
It works as follows. In the first iterate, the estimator takes the first element in the sequence. Then at each subsequent iteration, the algorithm produces a Bernouilli trial with a probability of success $p$. If the trial was successful, then the algorithm stops and returns the current estimate. If trial was not successful, then the algorithm continues and updates the current estimate $\Yhat$ with a new element of the sequence: $\Yhat + (1-p)^{-t}(Y_t - Y_{t-1})$.
</p>
<p class="framed">
<b class="tufte-underline">Russian Roulette Estimator</b><br>
<b>Input:</b> Probability $p \in [0, 1)$, initial estimate $Y_0$<br>
Set $\Yhat = Y_0$<br>
<b>For</b> $t=1, \ldots$ <b>do</b><br>
<span style="margin-left: 1em">With probability $p$: halt and <b>return</b>
$\Yhat$</span><br>
<span style="margin-left: 1em">Compute $Y_t$ and $\Yhat =
\Yhat + (1-p)^{-t}(Y_t - Y_{t-1})$
</span><br>
</p>
<p>
Unlike most algorithms, this one doesn't stop after a pre-determined number of iteration. Instead, the number of iterations the algorithm performs –which I'll refer to as the <i>stopping time</i>– is itself a random variable. It's probability is then given by the number of Bernoulli trials needed to get one success. This is the definition of the <a href="https://en.wikipedia.org/wiki/Geometric_distribution">geometric distribution</a> with parameter $p$. </p>
<p>
Showing that $\Yhat$ is an unbiased estimator of the limit can be done by expanding the definition of $\Yhat$ and taking expectations. It's also interesting to note that the elements $Y_i$ in the sequence can also be stochastic themselves, and the proofs below goes through as long as the randomness in this sequence is independent of the halting time.
</p>
<p class="theorem" text="unbiasedness">
The Russian Roulette is an unbiased estimator estimator of the limit of the sequence. That is, if $\EE$ denotes the expectation with respect to the randomness of the halting time, then we have $\EE[\Yhat] = Y_{\infty}$.
</p>
<p class="proof">
Let $T$ denote the total number of iterations performed by the estimator $\Yhat$ and ${\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}$ denote a variable that is $1$ if $i \lt T$ and $0$ otherwise (throughout the blog post I'll use color to highlight quantities that are <i>random</i>). Using this notation, we can write the Russian roulette as the infinite sum
\begin{equation}\label{eq:russian_roulette_infinite}
\Yhat = Y_0 + {\textstyle\sum_{i=1}^{\infty}} {\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}(1-p)^{-i} (Y_i - Y_{i-1}) \,,
\end{equation}
as all terms after $T$ will be zero. Since we know that $T$ follows a geometric distribution with parameter $p$, we can use this to compute $
\EE[{\color{brown}\unicode{x1D7D9}_{\{i \lt t\}}}]=\Proba(i \lt t) = 1 - \Proba(t \leq i)$. This last term is the cumulative distribution function of a geometric distribution, which is $1 - (1-p)^i$, and so $\EE[{\color{brown}\unicode{x1D7D9}_{\{i \lt t\}}}] = (1-p)^{i}$. Finally, taking expectations in the above formula and using this last fact we have the desired unbiasedness:
\begin{align}
\EE\big[\Yhat\big] &=\EE\left[Y_0 + {\textstyle\sum_{i=1}^{\infty}} {\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}(1-p)^{-i} (Y_i - Y_{i-1}) \right]\\
&= Y_0 + {\textstyle\sum_{i=1}^{\infty}} \EE[{\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}] (1-p)^{-i} (Y_i - Y_{i-1})\\
&=Y_0 + {\textstyle\sum_{i=1}^{\infty}} (Y_i - Y_{i-1}) = Y_{\infty}\,.
\end{align}
</p>
<h2>Application: Estimator of the Matrix Pseudoinverse</h2>
<p>
One of the first applications of the Russian Roulette was as an estimator of the pseudo-inverse $A^\dagger$ of a positive semi-definite matrix $A$.<dt-cite key="forsythe1950matrix"></dt-cite> This estimator is based on the <a href="https://en.wikipedia.org/wiki/Neumann_series">Neumann series</a> for the matrix pseudo-inverse:
\begin{equation}
A^{\dagger} = \gamma \sum_{i=0}^\infty (I - \gamma A)^i \,.
\end{equation}
This series is known to converge for any $\gamma \lt 2 / \lambda_{\max}$, where $\lambda_{\max}$ is $A$'s largest eigenvalue. We'll assume $\gamma$ is chosen such that the series is convergent.
</p>
<p>
We can then take as the items in the Russian Roulette sequence the partial sums of the above series, that is, $Y_t \defas \gamma \sum_{i=0}^t (I - \gamma A)^i $. This way, in light of the above identity, the pseudo-inverse is given by the limit $\lim_{t \to \infty}Y_t$. In this case, it will be more convenient to store the difference between two consecutive elements $\Delta_t \defas Y_{t} - Y_{t-1} = \gamma (I - \gamma A)^{t} = \Delta_{t-1} - \gamma A \Delta_{t-1}$, rather than the elements in the sequence $Y_0, Y_1, \ldots$. The resulting algorithm is:
</p>
<p class="framed">
<b class="tufte-underline">Matrix Pseudoinverse Estimator</b><br>
<b>Input:</b> Probability $p \in [0, 1)$ <br>
Set $\Yhat = \Delta_0 = \gamma I$ <br>
<b>For</b> $t=1, \ldots$ <b>do</b><br>
<span style="margin-left: 1em">With probability $p$: halt and <b>return</b>
$\Yhat$</span><br>
<span style="margin-left: 1em">Update $\Delta_{t} = \Delta_{t-1} - \gamma A \Delta_{t-1}$
</span><br>
<span style="margin-left: 1em">Update $\Yhat = \Yhat + (1-p)^{-t}\Delta_t$
</span><br>
</p>
<p>
If only an estimator of $A^\dagger b$ is necessary for some vector $b$, the above algorithm can be adapted by replacing the initialization with $\Yhat = X_0 = \gamma b$. This algorithm then doesn't need to store any matrices, only vectors.
</p>
<h2>This is amazing! How come it's not more widely used?</h2>
<p>
Exactly my thought after learning about this estimator! However, after having used it in different problems, my initial enthusiasm soon vanished. The Russian Roulette comes with severe drawbacks.
</p>
<p>
While unbiasedness is a desirable property, it's not the only thing that matters. A controlled variance is another key ingredient of a good estimator. For example, an estimator with infinite variance has infinite mean squared error no law of large numbers. It's hard to see how such an estimator would be useful. And it turns out that the Russian Roulette, unless one is extremely careful, will lead to an estimator with infinite variance.
</p>
<p>
Let's then take a look at the variance of the Russian Roulette.<dt-note>I'll call variance the quantity $\EE[\|\Yhat - Y_{\infty}\|_F^2]$, where $\|X\|^2_F = \tr(X^\top X)$ is the Frobenius norm. This corresponds to the classical variance when $Y_i$ is a scalar, but is also well defined for matrix and vector-valued estimators. Some authors refers to this as the <a href="https://en.wikipedia.org/wiki/Variance#For_vector-valued_random_variables">generalized variance</a>. </dt-note>
The following lemma computes the variance of this estimator.
</p>
<p class="theorem" text="variance"><dt-note>I believe this result is new, or at least I haven't been able to find it in the literature. The closest I found is <sup style="color: #f00000">12</sup> who discusses issues relative to the variance of this and other estimators, but they don't provide a simple formula like this one. If you disagree, please leave a comment!</dt-note>
The variance of the Russian Roulette estimate is
\begin{equation}\label{eq:variance}
\EE[\|\Yhat - Y_{\infty}\|_F^2] = \underbrace{\sum_{i=0}^{\infty}p\,\left({{1-p}}\right)^{-(i+1)}}_{\text{diverging}} \underbrace{\vphantom{\sum_{i=0}^{\infty}}\|Y_i - Y_{\infty}\|_F^2}_{\text{converging}} \,.
\end{equation}
</p>
<div class="wrap-collabsible"> <input id="collapsible3" class="toggle" type="checkbox"> <label for="collapsible3" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof" id="proof-variance">
<p>
The proof is rather straightforward once the error $Y_{\infty} - \Yhat$ is written as a sum of $R_i \defas Y_i - Y_{\infty}, R_{-1} = 0$ terms. Since $R_i - R_{i-1} = Y_i - Y_{i-1}$ we can write the Russian Roulette estimator from \eqref{eq:russian_roulette_infinite} as $\Yhat = Y_0 + \sum_{i=1}^\infty{\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}} (1-p)^{-i}(R_{i} - R_{i-1})$. Subtracting $Y_{\infty}$ and taking norms we then have
\begin{align}
\|Y_{\infty} - \Yhat\|^2
&= \|\sum_{i=0}^\infty{\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}} (1-p)^{-i}(R_{i} - R_{i-1})\|^2\\
&= \|\sum_{i=0}^{\infty} \underbrace{({\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}(1-p)^{-(i+1)} - {\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}(1-p)^{-i})}_{\defas {\color{olive}q_i} } R_i\|^2 \\
& = \sum_{i=0}^\infty {\color{olive}q_i}^2 \|R_i\|^2 + 2 \sum_{i=1}^{\infty} \sum_{j=i+1}^{\infty} {\color{olive}q_i} {\color{olive}q_j} \tr(R_i^\top R_j)\,.
\end{align}
</p>
<p>
The variance of our estimator is the expectation of the above expression, and so will depend on the expectations of ${\color{olive}q_i}^2$ and ${\color{olive}q_i} {\color{olive}q_j}$. Let's take a closer look at these quantities. The first one can be easily computed by expanding the square and noticing that ${\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}{\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}} = {\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}$:
\begin{align}
\EE[{\color{olive}q_i}^2] &= \EE[{\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}](1-p)^{-2 (i+1)} + \EE[{\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}](1-p)^{-2i} - 2 \overbrace{\EE[{\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}{\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}]}^{\EE[{\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}] = (1-p)^{-(i+1)}}(1-p)^{-2i - 1}] \\
&= (1-p)^{-(i+1)} - (1-p)^{-i}\\
&= p (1-p)^{-(i+1)}\,.
\end{align}
If $i \lt j$ we have ${\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}{\color{brown}\unicode{x1D7D9}_{\{j \lt T\}}} = {\color{brown}\unicode{x1D7D9}_{\{j \lt T\}}}$ and ${\color{brown}\unicode{x1D7D9}_{\{i +1 \lt T\}}}{\color{brown}\unicode{x1D7D9}_{\{j \lt T\}}} = {\color{brown}\unicode{x1D7D9}_{\{j \lt T\}}}$. We can use this to simplify the expectation of the cross product:
\begin{align}
\EE\left[{\color{olive}q_i} {\color{olive}q_j} \,|\, i\lt j\right] &= \EE ({\color{brown}\unicode{x1D7D9}_{\{j+1 \lt T\}}}(1-p)^{-(j+1)} - {\color{brown}\unicode{x1D7D9}_{\{j \lt T\}}}(1-p)^{-j})({\color{brown}\unicode{x1D7D9}_{\{i+1 \lt T\}}}(1-p)^{-(i+1)} - {\color{brown}\unicode{x1D7D9}_{\{i \lt T\}}}(1-p)^{-i})\\
&= \underbrace{\EE {\color{brown}\unicode{x1D7D9}_{\{j+1 \lt T\}}} (1-p)^{-(j+1)}}_{=1}\left[ (1-p)^{- (i+1)} - (1-p)^{ - i}\right] - \underbrace{\EE{\color{brown} \unicode{x1D7D9}_{\{j \lt T\}}}(1-p)^{-j}}_{=1}\left[ (1-p)^{-(i+1)} - (1-p)^{-i}\right]\\
&= 0\,.
\end{align}
Finally, taking expectations on the error norm we have
\begin{align}
\EE \|Y_{\infty} - \Yhat\|^2
& = \sum_{i=0}^\infty \overbrace{\EE[{\color{olive}q_i}^2]}^{=p(1-p)^{-(i+1)}} \|R_i\|^2 + 2 \sum_{i=1}^{\infty} \sum_{j=i+1}^{\infty} \overbrace{\EE[{\color{olive}q_i} {\color{olive}q_j}]}^{=0} \tr(R_i^\top R_j)\\
&= \sum_{i=0}^\infty p(1-p)^{-(i+1)} \|R_i\|^2 \,,
\end{align}
and the Theorem statement follows by definition of $R_i$.
</p>
</div>
</div></div></div></div>
<p>
Let's unpack what this means. The variance expression above contains the infinite geometric series $\sum_{i=0}^{\infty}p\,\left({{1-p}}\right)^{-(i+1)}$. Since the ratio of this series $(1-p)^{-1}$ is always greater than $1$ for any $p > 0$, this series is diverging. For the infinite sum (and so the variance) to be bounded, it's necessary that the terms $\|Y_{\infty} - Y_i\|^2_F$ multiplying this series compensate this divergence. Although we've assumed $Y_t$ converges to $Y_{\infty}$ as $t \to \infty$ and so these terms asymptotically vanish, this is not enough for the variance to be finite. For the variance to be finite, it's that the ratio of series \eqref{eq:variance} is smaller than one. In other words, the terms $\|Y_{\infty} - Y_i\|^2_F$ need to converge at a speed <span class="tufte-underline"> faster than $\mathcal{O}((1-p)^t)$</span>.
</p>
<p>
That's a lot to ask. For example, in the example of the matrix pseudoinverse, it's known that Neumann series with step-size $\gamma=\frac{1}{\lambda_{\max}}$ converges with the rate $\|Y_{\infty} - Y_i\|_F = \mathcal{O}\big((1 - \frac{\lambda_{\min}}{\lambda_{\max}})^{2t}\big)$. This gives the very narrow set acceptable values for $p$, namely $p \leq \frac{\lambda_{\min}}{\lambda_{\max}}(2 - \frac{\lambda_{\min}}{\lambda_{\max}})$.<dt-note>This bound could be improved to $p \leq \sqrt{\frac{\lambda_{\min}}{\lambda_{\max}}}(2 - \sqrt{\frac{\lambda_{\min}}{\lambda_{\max}}})$ by using the <a href="https://en.wikipedia.org/wiki/Approximation_theory#Chebyshev_approximation">Chebyshev approximation</a> to the inverse instead of the Neumann series. Although with a slightly better dependency, the average number of iterations $1/p$ will blow up as the inverse condition number $\frac{\lambda_{\min}}{\lambda_{\max}}$ vanishes.</dt-note> For problems with $\frac{\lambda_{\min}}{\lambda_{\max}} = 0.01$ (which is a relatively well-conditioned problem, practical problems often have an inverse condition number of $10^{-6}$), this gives the very narrow set of acceptable values $p \lt 0.0199$ (which results in estimators with an average number of iterations $\geq 50$). And it only gets worse as the problems becomes more ill-conditioned!
</p>
<figure>
<fieldset>
<legend>Take-home message</legend>
<p style="width: 100%">
The Russian Roulette will often result in useless estimators with infinite variance. Unless one is extremely careful and chooses fast-converging sequences or large average number of iterations (small $p$), then the variance of this estimator will be infinite.
</p>
</fieldset>
</figure>
<h2>Conclusion</h2>
<p>
The Russian Roulette is an unbiased estimator for the limit of a sequence. Its applications seem endless, as many hard problems can be cast as estimating the limit of a sequence. This includes –but not limited to– inverting a matrix, minimizing a function, computing recurrent computational graphs or performing bayesian inference.
</p>
<p>
However, my initial enthusiasm soon turned into skepticism. From the variance analysis, we found out that the range of cases where the Russian Roulette leads to a useful estimator is very narrow.
</p>
<p>
This might explain the fact that –to the best of my knowledge– none of the methods based on the Russian Roulette described in the application section have become mainstream.
</p>
<p>
Finally, I'd like to mention that this is not the only unbiased estimator of the limit. There are other ones with potentially better properties, which I haven't discussed in this blog post. For example (Beatson et al. 2019)<dt-cite key="beatson2019efficient"></dt-cite>, discusses the alternative <q>single-sample</q> estimator.
</p>
<h3>Acknowledgements</h3>
<p>
Thanks to
<a href="https://scholar.google.co.uk/citations?user=hYtGXD0AAAAJ&hl=en">Charles Sutton</a>, <a href="https://prof-girolami.uk/">Mark Girolami</a>, <a href="https://scholar.google.com/citations?user=-tEiRFcAAAAJ&hl=en">James Harrison</a> and <a href="http://csml.stats.ox.ac.uk/people/lelan/">Charline Le Lan</a> for discussions around this topic and feedback on the blog post.
</p>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing as
</p>
<blockquote>
<a href="http://fa.bianp.net/blog/2022/russian-roulette/">The Russian Roulette: An Unbiased Estimator of the Limit</a>, Fabian Pedregosa, 2022
</blockquote>
<p>
with bibtex entry:
</p>
<pre>
<code>
@misc{pedregosa2022russian,
title={The Russian Roulette: An Unbiased Estimator of the Limit},
author={Fabian Pedregosa},
howpublished = {\url{http://fa.bianp.net/blog/2022/russian-roulette/}},
year={2022}
}
</code>
</pre>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
Notes on the Frank-Wolfe Algorithm, Part III: backtracking line-search2022-08-26T00:00:00+02:002022-08-26T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2022-08-26:/blog/2022/adaptive_fw/
<p>
Backtracking step-size strategies (also known as adaptive step-size or approximate line-search) that set the step-size based on a sufficient decrease condition are the standard way to set the step-size on gradient descent and quasi-Newton methods. However, these techniques are much less common for <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">Frank-Wolfe-like</a> algorithms. In this blog post I …</p>
<p>
Backtracking step-size strategies (also known as adaptive step-size or approximate line-search) that set the step-size based on a sufficient decrease condition are the standard way to set the step-size on gradient descent and quasi-Newton methods. However, these techniques are much less common for <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">Frank-Wolfe-like</a> algorithms. In this blog post I discuss a backtracking line-search for the Frank-Wolfe algorithm. <br><br>
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<!-- the information icon -->
<link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Material+Symbols+Rounded:opsz,wght,FILL,GRAD@20..48,100..700,0..1,-50..200" />
<style>
.material-symbols-outlined {
font-variation-settings:
'FILL' 0,
'wght' 400,
'GRAD' 0,
'opsz' 48
}
</style>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@inproceedings{jaggi2013revisiting,
title={Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.},
author={Jaggi, Martin},
journal={Proceedings of the 30th International Conference on Machine Learning},
year={2013},
url={http://proceedings.mlr.press/v28/jaggi13-supp.pdf}
}
@article{armijo1966minimization,
title={Minimization of functions having Lipschitz continuous first partial derivatives},
author={Armijo, Larry},
journal={Pacific Journal of mathematics},
year={1966},
publisher={Mathematical Sciences Publishers},
url={https://projecteuclid.org/euclid.pjm/1102995080}
}
@inproceedings{lacoste2013block,
title={Block-Coordinate Frank-Wolfe Optimization for Structural SVMs},
author={Lacoste-Julien, Simon and Jaggi, Martin and Schmidt, Mark and Pletscher, Patrick},
journal={International Conference on Machine Learning},
year={2013}
}
@inproceedings{bauschke2017convex,
title={Convex Analysis and Monotone Operator Theory in Hilbert Spaces},
author={Bauschke, Heinz and Combettes, Patrick},
journal={CMS Books in Mathematics},
year={2013},
url={https://doi.org/10.1007/978-3-319-48311-5}
}
@inproceedings{frank1956algorithm,
author = {Frank, Marguerite and Wolfe, Philip},
title = {An algorithm for quadratic programming},
journal = {Naval Research Logistics Quarterly},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109},
}
@inproceedings{dem1967minimization,
title={The minimization of a smooth convex functional on a convex set},
author={Demyanov, Vladimir Fedorovich and Rubinov, Alexander M},
journal={SIAM Journal on Control},
year={1967},
url={https://doi.org/10.1137/0305019}
}
@article{pedregosa2018step,
title={Linearly Convergent Frank-Wolfe with Backtracking Line-Search},
author={Pedregosa, Fabian and Negiar, Geoffrey and Askari, Armin and Jaggi, Martin},
journal={Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)},
year={2020},
url={https://arxiv.org/pdf/1806.05123.pdf}
}
@article{pedregosa18adaptive,
title = {Adaptive Three Operator Splitting},
author = {Pedregosa, Fabian and Gidel, Gauthier},
journal = {Proceedings of the 35th International Conference on Machine Learning},
year = {2018},
pdf = {http://proceedings.mlr.press/v80/pedregosa18a/pedregosa18a.pdf},
url = {http://proceedings.mlr.press/v80/pedregosa18a.html},
}
@article{wright1999numerical,
title={Numerical optimization},
author={Wright, Stephen and Nocedal, Jorge},
journal={Springer Science},
pages={7},
year={1999},
url={https://www.springer.com/gp/book/9780387303031}
}
@article{dunn1980convergence,
title={Convergence rates for conditional gradient sequences generated by implicit step length rules},
author={Dunn, Joseph C},
journal={SIAM Journal on Control and Optimization},
year={1980},
publisher={SIAM},
url={https://doi.org/10.1137/0318035}
}
@article{rubinov1970approximate,
title={Approximate methods in optimization problems},
author={Demyanov, V. F. and Rubinov, A. M.},
year={1970},
journal={Elsevier},
url={https://doi.org/10.1002/zamm.19730530723}
}
@article{dunn1978conditional,
title={Conditional gradient algorithms with open loop step size rules},
author={Dunn, Joseph C and Harshbarger, S},
journal={Journal of Mathematical Analysis and Applications},
year={1978},
url={https://doi.org/10.1016/0022-247X(78)90137-3}
}
@inproceedings{locatello2017unified,
title={A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe},
author={Locatello, Francesco and Khanna, Rajiv and Tschannen, Michael and Jaggi, Martin},
journal={International Conference on Artificial Intelligence and Statistics},
year={2017},
url={https://arxiv.org/pdf/1702.06457.pdf}
}
@book{nesterov2013introductory,
title={Introductory lectures on convex optimization: A basic course},
author={Nesterov, Yurii},
volume={87},
year={2004},
journal={Springer Science \& Business Media},
url={https://link.springer.com/book/10.1007/978-1-4419-8853-9}
}
@article{dem1968minimization,
title={Minimization of functionals in normed spaces},
author={Demyanov, Vladimir and Rubinov, Aleksandr},
journal={SIAM Journal on Control},
year={1968},
publisher={SIAM},
url={https://doi.org/10.1137/0306006}
}
@article{goldstein1965steepest,
title={On steepest descent},
author={Goldstein, Allen A},
journal={Journal of the Society for Industrial and Applied Mathematics, Series A: Control},
volume={3},
url={https://apps.dtic.mil/sti/pdfs/AD0613588.pdf},
number={1},
pages={147--151},
year={1965},
publisher={SIAM}
}
@article{wolfe1969convergence,
title={Convergence conditions for ascent methods},
author={Wolfe, Philip},
journal={SIAM review},
volume={11},
number={2},
pages={226--235},
year={1969},
publisher={SIAM},
url={https://www.jstor.org/stable/2028111}
}
@article{beck2015cyclic,
title={The cyclic block conditional gradient method for convex optimization problems},
author={Beck, Amir and Pauwels, Edouard and Sabach, Shoham},
journal={SIAM Journal on Optimization},
volume={25},
url={https://epubs.siam.org/doi/10.1137/15M1008397},
number={4},
pages={2024--2049},
year={2015},
publisher={SIAM}
}
@article{clarkson2010coresets,
title={Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm},
author={Clarkson, Kenneth L},
journal={ACM Transactions on Algorithms (TALG)},
year={2010},
publisher={ACM New York, NY, USA},
url={https://doi.org/10.1145/1824777.1824783}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\bb{\boldsymbol b}
\def\dd{\boldsymbol d}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\def\defas{\stackrel{\text{def}}{=}}
\definecolor{colorstepsize}{RGB}{215,48,39}
\def\stepsize{{\color{colorstepsize}{\boldsymbol{\gamma}}}}
\definecolor{colorLipschitz}{RGB}{27,158,119}
\def\Lipschitz{{\color{colorLipschitz}{\boldsymbol{L}}}}
\definecolor{colorLocalLipschitz}{RGB}{117,112,179}
\def\LocalLipschitz{{\color{colorLocalLipschitz}{\boldsymbol{M}}}}
$$
</div>
<p><br>
<span class="marginnote"><span class="material-symbols-rounded" style="font-size: 48px">
info
</span></span><i> This is the third post in a series on the Frank-Wolfe algorithm. See here for <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">Part 1</a>, <a href="/blog/2018/fw2/">Part 2</a>.</i></p>
<h2>Introduction</h2>
<p>
The <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">Frank-Wolfe (FW)</a> or conditional gradient algorithm is a method for constrained optimization that solves problems of the form
</p>
<p class="framed">\begin{equation}\label{eq:fw_objective}
\minimize_{\xx \in \mathcal{D}} f(\boldsymbol{x})~,
\end{equation}
</p>
<p>
where $f$ is a smooth function for which we have access to its gradient and $\mathcal{D}$ is a compact set. We also assume to have access to a <i>linear minimization oracle</i> over $\mathcal{D}$, that is, a routine that solves problems of the form
\begin{equation}
\label{eq:lmo}
\ss_t \in \argmax_{\boldsymbol{s} \in \mathcal{D}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle~.
\end{equation}
</p>
<p>From an initial guess $\xx_0 \in \mathcal{D}$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2, \ldots$ that converge towards the solution to \eqref{eq:fw_objective}.
</p>
<p class="framed">
<b class="tufte-underline">Frank-Wolfe algorithm</b><br>
<b>Input:</b> initial guess $\xx_0$, tolerance $\delta > 0$ <br>
$\textbf{For }t=0, 1, \ldots \textbf{ do } $<br>
$\boldsymbol{s}_t \in \argmax_{\boldsymbol{s} \in \mathcal{D}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle$ <br>
Set $\dd_t = \ss_t - \xx_t$ and $g_t = \langle - \nabla f(\xx_t), \dd_t\rangle$ <br>
Choose step-size ${\stepsize}_t$ (to be discussed later) <br>
$\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + {\stepsize}_t \dd_t~.$ <br>
$\textbf{end For loop}$ <br>
$\textbf{return } \xx_t$
</p>
<p>
As other gradient-based methods, the FW algorithm depends on a step-size parameter ${\stepsize}_t$. Typical choices for this step-size are:
</p>
<p>
<b>1. Predefined decreasing sequence</b>. The simplest choice, developed in (Dunn and Harshbarger 1978)<dt-cite key="dunn1978conditional"></dt-cite> and more recently popularized in (Clarkson 2010, Jaggi 2013)<sup>,</sup><dt-cite key="clarkson2010coresets"></dt-cite><sup>,</sup><dt-cite key="jaggi2013revisiting"></dt-cite> is to choose the step-size according to the pre-defined decreasing sequence
\begin{equation}
{\stepsize}_t = \frac{2}{t+2}~.
\end{equation}
This choice of step-size is straightforward and cheap to compute. However, in practice it performs worst than the alternatives, although it enjoys the same worst-case complexity bounds.
</p>
<p>
<b>2. Exact line-search</b>. Another alternative is to take the step-size that maximizes the decrease in objective along the update direction:
\begin{equation}\label{eq:exact_ls}
\stepsize_\star \in \argmin_{\stepsize \in [0, 1]} f(\xx_t + \stepsize \dd_t)~.
\end{equation}
By definition, this step-size gives the highest decrease per iteration. However, solving \eqref{eq:exact_ls} can be a costly optimization problem, so this variant is not practical except on a few specific cases where the above problem is easy to solve (for instance, in quadratic objective functions).
</p>
<p>
<b>3. Demyanov-Rubinov step-size</b>. A less-known but highly effective step-size strategy for FW in the case in which we have access to the Lipschitz constant of $\nabla f$, denoted $\Lipschitz$,<dt-note>$\nabla f$ is $L$-Lipschitz if $\|\nabla f(\xx) - \nabla f(\yy)\| \leq \Lipschitz \|\xx - \yy\|$ for all $\xx, \yy$ in the domain.</dt-note> is the following:
\begin{equation}
\label{eq:ls_demyanov}
\stepsize = \min\left\{ \frac{g_t}{\Lipschitz\|\dd_t\|^2}, 1\right\}~.
\end{equation}
Note this step-size naturally goes to zero as we approach the optimum, which as we'll see in the next section is a desirable property. This is because the step-size is proportional to the Frank-Wolfe gap $g_t$, which is a measure of problem suboptimality.
This strategy was first published by Demyanov and Rubinov in the 1960s,<dt-note><img class="sideimage" src="/images/2022/demyanov_rubinov.png" alt=""> <br><a href="https://federation.edu.au/schools/school-of-engineering-information-technology-and-physical-sciences/research/computational-science-and-mathematics/ciao/alexander-rubinov">Alexander Rubinov (1940 – 2006)</a> (left) and <a href="https://dblp.org/pid/25/3977.html">Vladimir Demyanov</a> (1938–) (right) are two Russian pioneers fo optimization. They wrote the optimization textbook <a href="https://www.google.com/books/edition/_/KEvvAAAAMAAJ?hl=en&gbpv=0&kptab=overview">Approximate Methods in Optimization Problems</a>, which contains a throughout discussion of the different step-size choices in the Frank-Wolfe algorithm. </dt-note> <dt-cite key="dem1968minimization"></dt-cite> <dt-cite key="rubinov1970approximate"></dt-cite> although surprisingly it seems to be less popular than the other approaches.
</p>
<h2>Why not backtracking line-search?</h2>
<p>
Those familiar with methods based on gradient descent might be surprised by the omission of adaptive step-size methods (also known as backtracking line search) from the previous list. In <a href="https://en.wikipedia.org/wiki/Backtracking_line_search">backtracking line-search</a>, the step-size is selected based on a local condition.
Examples of these are the Armijo<dt-cite key="armijo1966minimization"></dt-cite> and Goldstein<dt-cite key="goldstein1965steepest"></dt-cite> (sometimes collectively referred to as Wolfe)<dt-cite key="wolfe1969convergence"></dt-cite> conditions.
These strategies have been wildly successful and are a core part of any state-of-the-art implementation of (proximal) gradient descent and Quasi-Newton methods. Surprisingly, backtracking line search have been almost absent from the literature on Frank-Wolfe.
</p>
<p>
There are important differences between the step-size of FW and gradient descent that can explain this disparity.
Consider the following figure. In the left hand side we show a toy 2-dimensional constrained problem, where the level curves represent the value of the objective function, the pentagon are the constrain set, and the orange and violet curve shows the path taken by Frank-Wolfe and Gradient descent respectively when the step-size goes to zero (left). The right side plot shows the optimal step-size (determined by exact line-search) at every point of the optimization path.
</p>
<p>
This last plot highlights two crucial differences between the Frank-Wolfe and Gradient Descent that we need to keep in mind when designing a step-size scheduler:
</p>
<figure class="fullwidth" style="background-color: #fffff8;">
<!-- <span class="marginnote">bla bla bla bla</span> -->
<img width="90%" src="/images/2022/fw_motivation.png" alt="Comparison of step-sizes between FW and gradient descent">
</figure>
<p>
<span class="marginnote">
<a href="https://colab.research.google.com/gist/fabianp/e9ae8c63d1bf27a2ec005f62cb5e6fa1/adaptive_frank_wolfe.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
</p>
<p>
<ul>
<!-- <li><b>Decreasing vs constant vs step-size.</b> In the FW algorithm and contrary to gradient-based methods, the update direction $\dd_t$ does not necessary </li> -->
<li><b>Stability and convergence to zero.</b> Gradient Descent converges with a constant (non-zero) step-size, Frank-Wolfe doesn't. It requires a decreasing step-size to anneal the (potentially non-zero) magnitude of the update.</li>
<li><b>Zig-zagging</b>. When close to the optimum, the Frank-Wolfe algorithm often exhibits a zig-zagging behavior, where the selected vertex $d_t$ in the update oscillates between two vertices. The best step-size might be different in these two directions, and so a good strategy should be able to quickly alternate between two different step-sizes.</li>
</ul>
</p>
<h2>Dissecting the Demyanov-Rubinov step-size</h2>
<p>
Understanding the Demyanov-Rubinov step-size will be crucial to developing an effective adaptive step-size in the next section.
</p>
<p>
The Demyanov-Rubinov step-size can be derived from a quadratic upper bound on the objective.
It's a classical result in optimization that a function with $\Lipschitz$-Lipschitz gradient admits the following quadratic upper bound for some constant $\Lipschitz$ and all $\xx, \yy$ in the domain:<dt-cite key="nesterov2013introductory"></dt-cite>
\begin{equation}\label{eq:l_lsmooth}
f(\yy) \leq f(\xx) + \langle \nabla f(\xx), \yy - \xx \rangle + \frac{\Lipschitz}{2}\|\xx - \yy\|^2~.
\end{equation}
Applying this inequality at the current and next FW iterate $(\xx = \xx_t, \yy = \xx_{t} + \stepsize \dd_t)$ we obtain
\begin{equation}\label{eq:l_smooth2}
f(\xx_t + \stepsize \dd_t) \leq f(\xx_t) - \stepsize g_t + \frac{\stepsize^2 \Lipschitz}{2}\|\dd_t\|^2~.
\end{equation}
</p>
<p>
The right hand side is a quadratic function of $\stepsize$. Minimizing it subject to the constraint $\stepsize \in [0, 1]$ –to guarantee the iterates remain the domain – gives the following solution:<dt-note>Exact line-search would correspond to minimizing the left hand side. We're relaxing the exact line-search problem and minimizing an upper bound on our true objective. </dt-note>
\begin{equation}
\stepsize_t^\text{DR} = \min\left\{ \frac{g_t}{\Lipschitz\|\dd_t\|^2}, 1\right\}\,,
\end{equation}
which is the DR step-size from Eq. \eqref{eq:ls_demyanov}.
</p>
<h2>Frank-Wolfe with backtracking line-search</h2>
<p>
The main drawback of the DR step-size is that it requires knowledge of the Lipschitz constant. This limits its usefulness because, first, it might be costly to compute this constant, and secondly, this constant is a <i>global</i> upper bound on the curvature, leading to suboptimal step-sizes in regions where the local curvature might be much smaller.
</p>
<p>
But there's a way around this limitation. It's possible <i>estimate</i> a step-size such that it guarantees a certain decrease of the objective without using global constants. The first work to attempt this was (Dunn 1980)<dt-cite key="dunn1980convergence"></dt-cite> who developed an analysis for the Goldstein-Armijo line-search. However, the algorithm that we'll present here is based on a different backtracking line-search algorithm which is more adapted to the Frank-Wolfe algorithm. It first appeared to the best of my knowledge in (Beck et al. 2015)<dt-cite key="beck2015cyclic"></dt-cite> and was later refined and generalized to many other Frank-Wolfe variants by myself and coauthors.<dt-cite key="pedregosa2018step"></dt-cite>
</p>
<p>
In this algorithm, instead of relying on the quadratic upper bound given by the Lipschitz constant, we construct a local quadratic approximation at $\xx_t$. This quadratic function is analogous to \eqref{eq:l_smooth2}, but where the global Lipschitz constant is replaced by the potentially much smaller local constant $\LocalLipschitz_t$:
\begin{equation}
Q_t(\stepsize, \LocalLipschitz_t) \defas f(\xx_t) - \stepsize g_t + \frac{\stepsize^2 \LocalLipschitz_t}{2} \|\dd_t\|^2\,.
\end{equation}
As with the Demyanov-Rubinov step-size, we will choose the step-size that minimizes the approximation on the $[0, 1]$ interval, that is
\begin{equation}
\stepsize_{t} = \min\left\{ \frac{g_t}{\LocalLipschitz_t\|\dd_t\|^2}, 1\right\}\,.
\end{equation}
Note that this step-size has a very similar form as the DR step-size, but with $\LocalLipschitz_t$ replacing $\Lipschitz$.
</p>
<p>
This is all fine and good, so far it seems we've only displaced the problem from estimating $\Lipschitz$ to estimating $\LocalLipschitz_t$.
Here is where things turn interesting: <u>there's a way to estimate $\LocalLipschitz_t$, that is both convenient and also gives strong theoretical guarantees.</u>
</p>
<p><span class="marginnote"><a href="/images/2020/fw_upper_bound.png"><img style="margin-top: 20px; width: 90%; display: block; margin: 0 auto; max-width: 250px; box-shadow: 6px 6px 3px grey;" src="/images/2022/fw_upper_bound.png"></a><br>The sufficient decrease condition ensures that the quadratic approximation is an upper bound at its constrained minimum of the line-search objective.</span>
For this, we'll choose the $\LocalLipschitz_t$ that makes the quadratic approximation an upper bound at the next iterate:
\begin{equation}\label{eq:sufficient_decrease}
f(\xx_{t+1}) \leq Q_t(\stepsize_t, \LocalLipschitz_t) \,.
\end{equation}
This way, we can guarantee that the backtracking line-search step-size makes at least as much progress as exact line-search in the quadratic approximation.
</p>
<p>
There are many values of $\LocalLipschitz_t$ that verify this condition. For example, from the $\Lipschitz$-smooth inequality \eqref{eq:l_smooth2} we know that any $\LocalLipschitz_t \geq \Lipschitz$ will be a valid choice. However, values of $\LocalLipschitz_t$ that are much smaller than $\Lipschitz$ will be the most interesting, as these will lead to larger step-sizes.
</p>
<p>
In practice there's little value in spending too much time finding the smallest possible value of $\LocalLipschitz_t$, and the most common strategy consists in initializing this value a bit smaller than the one used in the previous iterate (for example $0.99 \times \LocalLipschitz_{t-1}$), and correct if necessary.
</p>
<p>
Below is the full Frank-Wolfe algorithm with backtracking line-search. The parts responsible for the estimation of the step-size are between the comments <span style="color: gray"><i> /* begin of backtracking line-search */</i></span> and <span style="color: gray"><i> /* end of backtracking line-search */ </i></span>.
</p>
<p class="framed">
<b class="tufte-underline">Frank-Wolfe with backtracking line-search</b><br>
<b>Input:</b> initial guess $\xx_0$, tolerance $\delta > 0$, backtracking line-search parameters $\tau > 1$, $\eta \leq 1$, initial guess for $\LocalLipschitz_{-1}$.
<dt-note>Default values for the backtracking parameters that work well in my experience: $\tau = 2.0$ and $\eta = 0.9$. See below for an heuristic on $\LocalLipschitz_{-1}$. </dt-note><br>
$\textbf{For }t=0, 1, \ldots \textbf{ do } $
$\quad\boldsymbol{s}_t \in \argmax_{\boldsymbol{s} \in \mathcal{D}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle$ <br>
Set $\dd_t = \ss_t - \xx_t$ and $g_t = \langle - \nabla f(\xx_t), \dd_t\rangle$ <br>
<span style="color: gray"><i> /* begin of backtracking line-search */</i></span> <br>
$\LocalLipschitz_t = \eta \LocalLipschitz_{t-1}$ <br>
$\stepsize_t = \min\left\{{{g}_t}/{(\LocalLipschitz_t\|\dd_t\|^{2})}, 1\right\}$<br>
<b>While</b> $f(\xx_t + \stepsize_t \dd_t) > Q_t(\stepsize_t, \LocalLipschitz_t) $ <b>do </b> <dt-note>Increase $\LocalLipschitz_t$ until it satisfies the sufficient decrease condition \eqref{eq:sufficient_decrease}.</dt-note><br>
$\LocalLipschitz_t = \tau \LocalLipschitz_t$ <br>
<span style="color: gray"><i> /* end of backtracking line-search */ </i></span><br>
$\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + {\stepsize}_t \dd_t~.$ <br>
$\textbf{end For loop}$ <br>
$\textbf{return } \xx_t$
</p>
<p>
Let's unpack what happens inside the backtracking line-search block. The block starts by choosing a constant that is a factor of $\eta$ smaller than the one given by the previous iterate. If $\eta = 1$, then it will be exactly the same than the previous iterate, but if $\eta$ is smaller than $1$ (I found $\eta = 0.9$ to be a reasonable default), this will result in a candidate value of $\LocalLipschitz_t$ that is smaller than the one used in the previous iterate. This is done to ensure this constant can decrease if we move to a region with smaller curvature.
</p>
<p>
In the next line the algorithm sets a tentative value for the step-size based on the formula REF using the current (tentative) value for $\LocalLipschitz_t$. The next line is a While loop that increases $\LocalLipschitz_t$ until it verifies the sufficient decrease condition \eqref{eq:sufficient_decrease}
</p>
<p>
The algorithm is not <i>fully</i> agnostic to the <q>local</q> Lipschitz constant, as it still requires to set an initial value for this constant, $\LocalLipschitz_{-1}$. One heuristic that I found works well in practice is to initialize it to the (approximate) local curvature along the update direction. For this, select a small $\varepsilon$, say $\varepsilon = 10^{-3}$ and set
\begin{equation}
\LocalLipschitz_{-1} = \frac{\|\nabla f(\xx_0) - \nabla f(\xx_0 + \varepsilon \dd_0)\|}{\varepsilon \|\dd_0\|}\,.
\end{equation}
</p>
<h2>Convergence rates</h2>
<p>
By the sufficient decrease condition we have at each iteration
\begin{align}
\mathcal{P}(\xx_{t+1}) &\leq \mathcal{P}(\xx_t) - \stepsize_t g_t +
\frac{\stepsize_t^2 \LocalLipschitz_t}{2}\|\ss_t - \xx_t\|^2 \\
&\leq \mathcal{P}(\xx_t) - \xi_t g_t +
\frac{\xi_t^2 \LocalLipschitz_t}{2}\|\ss_t - \xx_t\|^2 \text{ for any $\xi_t \in [0, 1]$} \\
&\leq \mathcal{P}(\xx_t) - \xi_t g_t +
\frac{\xi_t^2 \tau \Lipschitz}{2}\|\ss_t - \xx_t\|^2 \text{ for any $\xi_t \in [0, 1]$}\,,
\end{align}
where in the second inequality we have used the fact that $\stepsize_t$ is the minimizer of this right-hand side over the $[0, 1]$ interval, and so it's an upper bound on any value in this interval. In the third inequality we have used that the sufficient decrease condition is verified for any $\LocalLipschitz \geq \Lipschitz$, and so the backtracking loop cannot bring its value below $\tau \LocalLipschitz$, as long as it was initialized with a value above this constant.
</p>
<p>
The last identity is the same one we used as starting point in Theorem 3 of <a href="/blog/2018/fw2/">the second part of these notes</a> to show the sublinear convergence for convex objectives (with a $\Lipschitz$ instead of $\tau \Lipschitz$). Following the same proof then yields a $\mathcal{O}(\frac{1}{t})$ convergence rate on the primal-dual gap, with $\Lipschitz$ replaced by $\tau \Lipschitz$.
</p>
<p>
Our paper<dt-cite key="pedregosa2018step"></dt-cite> contains the full proof, including a proof for non-convex objectives, as well as extensions to many other Frank-Wolfe variants such as Pairwise and Away-steps.
</p>
<h2>Benchmarks</h2>
<p>
The empirical speedup of the backtracking line-search is huge, sometimes up to an order of magnitude. Below I compare Frank-Wolfe with backtracking line-search (denoted <q>adaptive</q>) and with Demyanov-Rubinov step-size (denoted <q>Lipschitz</q>). All the problems are instances of logistic regression with $\ell_1$ regularization on different datasets using the <a href="https://github.com/openopt/copt">copt</a> software package.
</p>
<figure class="">
<span class="marginnote">Comparison between Frank-Wolfe with backtracking (denoted <q>adaptive</q>) and Demyanov-Rubinov (denoted <q>Lipschitz</q>) on different datasets. <br><br><a href="https://openopt.github.io/copt/auto_examples/frank_wolfe/plot_sparse_benchmark.html#sphx-glr-auto-examples-frank-wolfe-plot-sparse-benchmark-py">Source code</a> </span>
<ul style="list-style: none; display: block; width: 100%">
<li style="display: inline; line-height: 1.4em;"><img style="max-width: 42%; display: inline; margin-top: 0px; height: auto !important; border: 0" alt="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_001.png" class="sphx-glr-multi-img" src="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_001.png">
</li>
<li style="display: inline; line-height: 1.4em;"><img style="max-width: 42%; display: inline; margin-top: 0px; height: auto !important; border: 0" alt="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_002.png" class="sphx-glr-multi-img" src="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_002.png">
</li>
<li style="display: inline; line-height: 1.4em;"><img style="max-width: 42%; display: inline; margin-top: 0px; height: auto !important; border: 0" alt="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_003.png" class="sphx-glr-multi-img" src="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_003.png">
</li>
<li style="display: inline; line-height: 1.4em;"><img style="max-width: 42%; display: inline; margin-top: 0px; height: auto !important; border: 0" alt="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_004.png" class="sphx-glr-multi-img" src="https://openopt.github.io/copt/auto_examples/frank_wolfe/../../_images/sphx_glr_plot_sparse_benchmark_004.png">
</li>
</ul>
<!-- <img src="https://openopt.github.io/copt/_images/sphx_glr_plot_sparse_benchmark_001.png" alt=""> -->
</figure>
<h2>Citing</h2>
<p>
If you've found this blog post useful, please consider citing it's full-length companion paper. This paper extends the theory presented here to other Frank-Wolfe variants such as Away-steps Frank-Wolfe and Pairwise Frank-Wolfe. Furthermore, it shows that these variants maintain its linear convergence with backtracking line-search.
</p>
<blockquote>
<a href="https://arxiv.org/pdf/1806.05123.pdf">
Linearly Convergent Frank-Wolfe with Backtracking Line-Search</a>, Fabian Pedregosa, Geoffrey Negiar, Armin Askari, Martin Jaggi. <i>Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)</i>, 2020
</blockquote>
<p>
Bibtex entry:
</p>
<pre>
<code>
@inproceedings{pedregosa2020linearly,
title={Linearly Convergent Frank-Wolfe with Backtracking Line-Search},
author={Pedregosa, Fabian and Negiar, Geoffrey and Askari, Armin and Jaggi, Martin},
booktitle={International Conference on Artificial Intelligence and Statistics},
series = {Proceedings of Machine Learning Research},
year={2020}
}
</code>
</pre>
<h3>Thanks</h3>
<p>
Thanks <a href="http://geoffreynegiar.com/">Geoffrey Negiar</a> and <a href="https://q-berthet.github.io/">Quentin Berthet</a> for feedback on this post.
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
On the Link Between Optimization and Polynomials, Part 52022-05-27T00:00:00+02:002022-05-27T00:00:00+02:00<a href='https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en'>Baptiste Goujaud</a> and <a href='http://fa.bianp.net/pages/about.html'>Fabian Pedregosa</a>tag:fa.bianp.net,2022-05-27:/blog/2022/cyclical/
<blockquote class="pullquote" style="margin-left: 20px">
<p>
<br><i>
<b style="font-style: normal;">Six</b>: All of this has happened before. <br>
<b style="font-style: normal;">Baltar</b>: But the question remains, does all of this have to happen again?<br>
<b style="font-style: normal;">Six</b>: This time I bet no.<br>
<b style="font-style: normal;">Baltar</b>: You know, I've never known you to play the optimist. Why the change of heart?<br>
<b style="font-style: normal;">Six</b>: Mathematics. Law of averages. Let a complex …</i></p></blockquote>
<blockquote class="pullquote" style="margin-left: 20px">
<p>
<br><i>
<b style="font-style: normal;">Six</b>: All of this has happened before. <br>
<b style="font-style: normal;">Baltar</b>: But the question remains, does all of this have to happen again?<br>
<b style="font-style: normal;">Six</b>: This time I bet no.<br>
<b style="font-style: normal;">Baltar</b>: You know, I've never known you to play the optimist. Why the change of heart?<br>
<b style="font-style: normal;">Six</b>: Mathematics. Law of averages. Let a complex system repeat itself long enough and eventually something surprising might occur. That, too, is in God's plan.
</i><br>
</p>
<p style="text-align: right;">
Battlestar Galactica <a href="https://en.wikiquote.org/wiki/Battlestar_Galactica_(2003)#Daybreak_Part_2_[4.22]">*</a>
</p>
</blockquote>
<p>
</p>
<p>
Momentum with cyclical step-sizes has been shown to sometimes accelerate convergence, but why? We'll take a closer look at this technique, and with the help of polynomials unravel some of its mysteries.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{Rutishauser1959,
author="Rutishauser, H.",
title="Theory of Gradient Methods",
journal="Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-Adjoint Boundary Value Problems",
year="1959",
url="https://doi.org/10.1007/978-3-0348-7224-9_2"
}
@book{fischer1996polynomial,
title={Polynomial based iteration methods for symmetric linear systems},
author={Fischer, Bernd},
year={1996},
url={https://doi.org/10.1007/978-3-663-11108-5},
journal={Springer}
}
@article{hestenes1952methods,
title={Methods of conjugate gradients for solving linear systems},
author={Hestenes, Magnus and Stiefel, Eduard},
journal={Journal of research of the National Bureau of Standards},
year={1952},
url={https://pdfs.semanticscholar.org/466d/addfb6340c28cb8da548007028c8cc5df687.pdf}
}
@article{pedregosa2020average,
title={Average-case Acceleration Through Spectral Density Estimation},
author={Pedregosa, Fabian and Scieur, Damien},
journal={arXiv preprint arXiv:2002.04756},
year={2020},
url={https://arxiv.org/pdf/2002.04756.pdf}
}
@article{scieur2020universal,
title={Universal Average-Case Optimality of Polyak Momentum},
author={Scieur, Damien and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2002.04664},
year={2020},
url={https://arxiv.org/pdf/2002.04664.pdf}
}
@article{polyak1964some,
title={Some methods of speeding up the convergence of iteration methods},
author={Polyak, Boris T},
journal={USSR Computational Mathematics and Mathematical Physics},
year={1964},
url={https://doi.org/10.1016/0041-5553(64)90137-5}
}
@article{polyak1987introduction,
title={Introduction to Optimization},
author={Polyak, Boris T},
journal={Optimization Software, Inc. Publications Division, New York},
url={https://b-ok.cc/book/2461679/c8b7e4},
year={1987}
}
@article{frankel1950convergence,
title={Convergence rates of iterative treatments of partial differential equations},
author={Frankel, Stanley},
journal={Mathematical Tables and Other Aids to Computation},
year={1950},
url={https://www.jstor.org/stable/2002770},
publisher={JSTOR}
}
@inproceedings{sutskever2013importance,
title={On the importance of initialization and momentum in deep learning},
author={Sutskever, Ilya and Martens, James and Dahl, George and Hinton, Geoffrey},
journal={International conference on machine learning},
year={2013},
url={http://proceedings.mlr.press/v28/sutskever13.pdf}
}
@book{elaydi2005introduction,
title={An introduction to difference equations},
author={Elaydi, Saber},
year={2005},
journal={Springer}
}
@article{perron1921summengleichungen,
title={{\"U}ber summengleichungen und Poincarésche differenzengleichungen},
author={Perron, Oskar},
journal={Mathematische Annalen},
year={1921},
publisher={Springer}
}
@article{hochstrasser1954anwendung,
title={Die Anwendung der Methode der konjugierten Gradienten und ihrer Modifikationen auf die Lösung linearer Randwertprobleme},
author={Hochstrasser, Urs},
year={1954},
journal={Doctoral Thesis (in German), ETH Zurich},
url={https://doi.org/10.3929/ethz-a-000091966}
}
@inproceedings{ghadimi2015global,
title={Global convergence of the heavy-ball method for convex optimization},
author={Ghadimi, Euhanna and Feyzmahdavian, Hamid Reza and Johansson, Mikael},
booktitle={2015 European Control Conference (ECC)},
year={2015},
url={https://arxiv.org/abs/1412.7457},
}
@inproceedings{flammarion2015averaging,
title={From averaging to acceleration, there is only a step-size},
author={Flammarion, Nicolas and Bach, Francis},
booktitle={Conference on Learning Theory},
year={2015}
}
@article{totik2005orthogonal,
title={Orthogonal polynomials},
author={Totik, Vilmos},
journal={arXiv preprint math/0512424},
year={2005},
url={https://arxiv.org/pdf/math/0512424.pdf}
}
@book{suli2003introduction,
title={An introduction to numerical analysis},
author={Süli, Endre and Mayers, David},
year={2003},
publisher={Cambridge University Press},
url={https://www.cambridge.org/core/books/an-introduction-to-numerical-analysis/FD8BCAD7FE68002E2179DFF68B8B7237}
}
@book{szeg1975orthogonal,
title={Orthogonal polynomials},
author={Szegő, Gábor},
year={1975},
journal={American Mathematical Soc.},
url={https://people.math.osu.edu/nevai.1/SZEGO/szego=szego1975=ops=OCR.pdf}
}
@book{gautschi2004orthogonal,
title={Orthogonal polynomials},
author={Gautschi, Walter},
year={2004},
journal={Oxford University Press},
url={https://global.oup.com/academic/product/orthogonal-polynomials-9780198506720?cc=ca&lang=en&}
}
@article{marcellan2001favard,
title={On the “Favard theorem” and its extensions},
author={Marcellán, Francisco and Álvarez-Nodarse, Renato},
journal={Journal of computational and applied mathematics},
year={2001},
url={http://merlin.us.es/~renato/papers/fav-jcam.pdf},
publisher={Elsevier}
}
@article{favard1935,
author = {Jean Favard},
title = {Sur les polynomes de Tchebicheff},
journal = {Comptes Rendus Hebdomadaires des Séances de l'Académie des Sciences, Paris},
url={https://gallica.bnf.fr/ark:/12148/bpt6k3152t/f2052.item},
volume = {200},
year = {1935},
publisher = {Gauthier-Villars, Paris},
}
@book{mhaskar1997introduction,
title={Introduction to the theory of weighted polynomial approximation},
author={Mhaskar, Hrushikesh Narhar},
volume={7},
year={1997},
publisher={World Scientific}
}
@book{grenander1958toeplitz,
title={Toeplitz forms and their applications},
author={Grenander, Ulf and Szegő , Gábor},
year={1958},
publisher={Univ of California Press}
}
@article{van1991orthogonal,
title={Orthogonal polynomials, associated polynomials and functions of the second kind},
author={Van Assche, Walter},
journal={Journal of computational and applied mathematics},
volume={37},
number={1-3},
pages={237--249},
year={1991},
publisher={Elsevier},
url={https://doi.org/10.1016/0377-0427(91)90121-Y}
}
@article{zhang2017yellowfin,
title={Yellowfin and the art of momentum tuning},
author={Zhang, Jian and Mitliagkas, Ioannis},
journal={SysML},
year={2018},
url={https://arxiv.org/pdf/1706.03471.pdf}
}
@article{paquette2020halting,
title={Halting Time is Predictable for Large Models: A Universality Property and Average-case Analysis},
author={Paquette, Courtney and van Merriënboer, Bart and Paquette, Elliot and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2006.04299},
url={https://arxiv.org/pdf/2006.04299.pdf},
year={2020}
}
@article{goh2017momentum,
title={Why momentum really works},
author={Goh, Gabriel},
journal={Distill},
volume={2},
number={4},
pages={e6},
year={2017},
url={https://distill.pub/2017/momentum/}
}
@article{nemirovsky1992information,
title={Information-based complexity of linear operator equations},
author={Nemirovsky, Arkadi S},
journal={Journal of Complexity},
volume={8},
number={2},
pages={153--175},
year={1992},
publisher={Academic Press}
}
@article{van1985asymptotic,
title={Asymptotic properties of orthogonal polynomials from their recurrence formula, I},
author={Van Assche, Walter},
journal={Journal of approximation theory},
volume={44},
number={3},
pages={258--276},
year={1985},
publisher={Elsevier},
url={https://doi.org/10.1016/0021-9045(85)90097-8}
}
@inproceedings{loshchilov2016sgdr,
title={SGDR: stochastic gradient descent with warm restarts},
author={Loshchilov, Ilya and Hutter, Frank},
journal={International Conference on Learning Representations (ICLR)},
year={2017},
url={https://arxiv.org/pdf/1608.03983.pdf}
}
@article{chihara1968orthogonal,
title={Orthogonal polynomials whose zeros are dense in intervals},
author={Chihara, TS},
journal={Journal of Mathematical Analysis and Applications},
volume={24},
number={2},
pages={362--371},
year={1968},
url={https://www.sciencedirect.com/science/article/pii/0022247X68900371/pdf?md5=a477421f6abb2fdeccd8b52968761213&pid=1-s2.0-0022247X68900371-main.pdf},
publisher={Elsevier}
}
@inproceedings{smith2017cyclical,
title={Cyclical learning rates for training neural networks},
author={Smith, Leslie N},
journal={2017 IEEE Winter Conference on Applications of Computer Vision (WACV)},
year={2017},
organization={IEEE},
url={https://arxiv.org/pdf/1506.01186.pdf}
}
@article{goujaud2021super,
title={Super-Acceleration with Cyclical Step-sizes},
author={Goujaud, Baptiste and Scieur, Damien and Dieuleveut, Aymeric and Taylor, Adrien and Pedregosa, Fabian},
journal={Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
year={2022},
url={https://arxiv.org/pdf/2106.09687.pdf}
}
@article{oymak2021provable,
title={Provable Super-Convergence with a Large Cyclical Learning Rate},
author={Oymak, Samet},
journal={IEEE Signal Processing Letters},
year={2021},
publisher={IEEE},
url={https://arxiv.org/pdf/2102.10734.pdf}
}
@article{fu2019cyclical,
title={Cyclical annealing schedule: A simple approach to mitigating KL vanishing},
author={Fu, Hao and Li, Chunyuan and Liu, Xiaodong and Gao, Jianfeng and Celikyilmaz, Asli and Carin, Lawrence},
journal={arXiv preprint arXiv:1903.10145},
url={https://arxiv.org/pdf/1903.10145.pdf},
year={2019}
}
@article{agarwal2021acceleration,
title={Acceleration via Fractal Learning Rate Schedules},
author={Agarwal, Naman and Goel, Surbhi and Zhang, Cyril},
journal={Proceedings of the 38th International Conference on Machine Learning},
url={https://arxiv.org/pdf/2103.01338.pdf},
year={2021}
}
@inproceedings{stieltjes1894recherches,
title={Recherches sur les fractions continues},
author={Stieltjes, T-J},
journal={Annales de la Faculté des sciences de Toulouse: Mathématiques},
year={1894},
url={https://afst.centre-mersenne.org/article/AFST_1894_1_8_4_J1_0.pdf}
}
@article{chebyshev1900,
author = {P. Tchebycheff},
title = {Sur la représentation des valeurs limites des intégrales par des résidus intégraux},
journal = {Acta Mathematica},
year = {1900},
url = {https://doi.org/10.1007/BF02406728}
}
@article{valent1995impact,
title={The impact of Stieltjes' work on continued fractions and orthogonal polynomials: additional material},
author={Valent, Galliano and Van Assche, Walter},
journal={Journal of Computational and Applied Mathematics},
volume={65},
number={1-3},
pages={419--447},
year={1995},
publisher={Elsevier}
}
@article{markoff1916polynome,
title={Über Polynome, die in einem gegebenen Intervalle möglichst wenig von Null abweichen},
author={Markov, Vladimir},
journal={Mathematische Annalen},
year={1916},
openaccess={https://sci-hub.se/10.1007/BF01456902},
url={https://doi.org/10.1007/BF01456902}
}
@article{zhang2019cyclical,
title={Cyclical stochastic gradient MCMC for Bayesian deep learning},
author={Zhang, Ruqi and Li, Chunyuan and Zhang, Jianyi and Chen, Changyou and Wilson, Andrew Gordon},
journal={ICLR 2020},
year={2019},
url={https://iclr.cc/virtual_2020/poster_rkeS1RVtPS.html}
}
@article{flanders1950numerical,
title={Numerical determination of fundamental modes},
author={Flanders, Donald A and Shortley, George},
journal={Journal of Applied Physics},
year={1950},
url={https://doi.org/10.1063/1.1699598},
}
@article{papyan2020traces,
title={Traces of class/cross-class structure pervade deep learning spectra},
author={Papyan, Vardan},
journal={Journal of Machine Learning Research},
pages={1--64},
year={2020},
url={https://arxiv.org/pdf/2008.11865.pdf}
}
@inproceedings{ghorbani2019investigation,
title={An investigation into neural net optimization via hessian eigenvalue density},
author={Ghorbani, Behrooz and Krishnan, Shankar and Xiao, Ying},
booktitle={International Conference on Machine Learning},
pages={2232--2241},
year={2019},
organization={PMLR},
url={https://arxiv.org/pdf/1901.10159.pdf}
}
@article{stahl1990nth,
title={Nth Root Asymptotic Behavior of Orthonormal Polynomials},
author={Stahl, Herbert and Totik, Vilmos},
journal={Orthogonal polynomials},
year={1990},
publisher={Springer},
url={https://link.springer.com/chapter/10.1007/978-94-009-0501-6_18}
}
@article{sagun2016eigenvalues,
title={Eigenvalues of the hessian in deep learning: Singularity and beyond},
author={Sagun, Levent and Bottou, Leon and LeCun, Yann},
journal={arXiv preprint arXiv:1611.07476},
url={https://arxiv.org/pdf/1611.07476.pdf},
year={2016}
}
@article{sagun2017empirical,
title={Empirical analysis of the hessian of over-parametrized neural networks},
author={Sagun, Levent and Evci, Utku and Guney, V Ugur and Dauphin, Yann and Bottou, Leon},
journal={arXiv preprint arXiv:1706.04454},
url={https://arxiv.org/pdf/1706.04454.pdf},
year={2017}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\mu}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colormomentum}{RGB}{27, 158, 119}
\definecolor{colorstepsize1}{RGB}{217,95,2}
\definecolor{colorstepsize2}{RGB}{117,112,179}
\definecolor{colorstepsize}{RGB}{152,78,163}
\def\mom{{\color{colormomentum}m}}
\def\stepzero{{\color{colorstepsize1}h_0}}
\def\stepone{{\color{colorstepsize2}h_1}}
\def\tildestepzero{{\color{colorstepsize1}\tilde{h}_0}}
\def\tildestepone{{\color{colorstepsize2}\tilde{h}_1}}
\def\momt{{\color{colormomentum}m_t}}
\def\ht{{\color{colorstepsize}h_t}}
\def\stept{{\color{colorstepsize}h_t}}
\def\step{{\color{colorstepsize}h}}
\definecolor{colorexternaleigenvalues}{RGB}{152, 78, 163}
\definecolor{colorinternaleigenvalues}{RGB}{77, 175, 74}
\def\muone{{\color{colorexternaleigenvalues}{\mu}_1}}
\def\Lone{{\color{colorinternaleigenvalues}{L}_1}}
\def\mutwo{{\color{colorinternaleigenvalues}{\mu}_2}}
\def\Ltwo{{\color{colorexternaleigenvalues}{L}_2}}
$$
</div>
<h2>Cyclical Heavy Ball</h2>
<p>
The most common variant of stochastic gradient descent uses a constant or decreasing step-size. However, some recent works advocate instead for <i>cyclical</i> step-sizes, where the step-size alternates between two or more values.<dt-cite key="loshchilov2016sgdr"></dt-cite> <dt-cite key="smith2017cyclical"></dt-cite> In the blink of an eye, these methods have gone from esoteric to ubiquitous: they are now available in all major deep learning toolboxes,<dt-note>See for example <a href="https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/CyclicalLearningRate">TensorFlow</a>, <a href="https://flax.readthedocs.io/en/latest/howtos/lr_schedule.html">Flax</a> and <a href="https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CyclicLR.html">PyTorch</a> implementations.</dt-note> and found applications beyond optimization in fields like probabilistic modeling.<dt-cite key="fu2019cyclical"></dt-cite> <dt-cite key="zhang2019cyclical"></dt-cite>
</p>
<p>
The scheme that we will consider has two step-sizes $\stepzero, \stepone$, so that odd iterations use one step-size while even iterations use a different one.<dt-note>There are subtle differences between the various works published so far. For example, in the original work <a href="https://arxiv.org/pdf/1506.01186.pdf">(Smith, 2017)</a>, the step-sizes take a sequence of values between $\stepzero$ and $\stepone$. The variant that we analyze takes instead two discrete step-sizes. </dt-note> The algorithm starts with an initial guess $\xx_0$ and produces successive approximations $\xx_1, \xx_2, \ldots$ to the minimizer of $f$ as follows:
</p>
<p class="framed">
<b class="tufte-underline">Cyclical Heavy Ball</b><br>
<b>Input</b>: starting guess $\xx_0$, step-sizes $0 \leq \stepzero \leq \stepone$ and momentum ${\color{colormomentum} m} \in [0, 1)$.<br>
$\xx_1 = \xx_0 - \dfrac{\stepzero}{1 + {\color{colormomentum} m}} \nabla f(\xx_0)$ <br>
<b>For</b> $t=1, 2, \ldots$ <br>
\begin{align*}\label{eq:momentum_update}
&\text{${\color{colorstepsize}h_t} = \stepzero$ if $t$ is even and ${\color{colorstepsize}h_t} = \stepone$ otherwise}\\
&\xx_{t+1} = \xx_t - {\color{colorstepsize}h_t} \nabla
f(\xx_t)+ {\color{colormomentum} m}(\xx_{t} - \xx_{t-1})
\end{align*}
</p>
<p>
By alternating between a small and a large step-size, the cyclical step-size algorithm has very different dynamics than the classical constant step-size variant:
</p>
<figure>
<span class="marginnote"> <br><br>In both cases the parameters are set to those that maximize the asymptotic worst-case rate, assuming an imperfect lower and upper bound on the Hessian eigenvalues. <br><br>
<a href="https://colab.research.google.com/github/google-research/google-research/blob/master/sobolev/examples/cyclical_learning_rates.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
</span>
<img style="background-color: #fffff8;" src="/images/2021/comparison_cyclical.png" alt="">
</figure>
<p>
In this post, we'll show that not only the cyclical heavy-ball can reach the same convergence as <q>optimal</q> first-order methods such as Polyak and Nesterov momentum, but it can in fact achieve a <i>faster</i> convergence in some scenarios.<dt-note>Wait, didn't I just say that that Polyak heavy ball is optimal? How can you beat an optimal method? Hold on for a bit longer and I'll explain this apparent contradiction.</dt-note> We'll even be able to quantify this speedup in terms of quantities such as the <i>eigengap</i>, a gap in the Hessian's eigenvalues.
</p>
<p>
As in previous posts, the theory of orthogonal polynomials will be our best ally throughout the journey. We'll start by expressing the residual polynomial of the cyclical heavy ball method as a convex combination of Chebyshev polynomials of the first and second kind (<a href="#sec2">Section 2</a>), and then we'll use properties of Chebyshev polynomials to derive the worst-case optimal set of parameters (<a href="#sec3">Section 3</a>).
</p>
<p>
This post is based on a recent paper with the incredible <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a>, <a href="https://damienscieur.com/">Damien Scieur</a>, <a href="http://www.cmap.polytechnique.fr/~aymeric.dieuleveut/">Aymeric Dieuleveut</a>, and <a href="https://www.di.ens.fr/~ataylor/">Adrien Taylor</a>.<dt-cite key="goujaud2021super"></dt-cite>
The blog post only covers a subset of the results in the paper. In particular, the paper also has results on longer cycles and local convergence for non-quadratic objectives which are not covered in this blog post.
</p>
<h3>Related work</h3>
<p>
Very related to our paper is the recent work of Oymak.<dt-cite key="oymak2021provable"></dt-cite> Here, the author derived convergence rates for gradient descent (without momentum) using a scheduler that takes <q>one large, $K-1$ small</q> steps instead of the alternating scheme that's the subject of this post. He showed improved convergence rates of that scheme under the same eigengap assumption we'll describe later. One important difference is that (Oymak 2021) doesn't achieve the optimal rate on general strongly convex quadratics (see Appendix E.5 <dt-cite key="agarwal2021acceleration"></dt-cite>). In contrast, the method we'll describe here defaults to Polyak heavy ball and achieves the optimal rate on general quadratics.
</p>
<h2>Chebyshev Meets Cyclical Learning Rates</h2>
<p>
As in previous blog posts in this polynomials series, we'll assume our objective function $f$ is a quadratic of the form
\begin{equation}\label{eq:opt}
f(\xx) \defas \frac{1}{2}\xx^\top \HH \xx + \bb^\top \xx + c~,
\end{equation}
where $\HH$ is a $d\times d$ positive semi-definite matrix with eigenvalues in some set $\Lambda$.
</p>
<p>
As we saw in Part 2 of this series, with any gradient-based method we can associate a residual polynomial $P_t$ that will determine its convergence behavior.<dt-note>The notion of residual polynomial was introduced in <a href="http://fa.bianp.net/blog/2020/polyopt/#sec2">Part 1</a> of this series.</dt-note> This association results in the following inequality that relates the error at iteration $t$ by the maximum value of the polynomial over the Hessian eigenvalue support set $\lambda$:
\begin{equation}
\|\xx_t - \xx^\star\| \leq {\color{purple}\underbrace{\vphantom{\max_{[L]}}\max_{\lambda \in \Lambda
}}_{\text{conditioning}}} \,
{\color{teal}\underbrace{\vphantom{\max_{[L]}}|P_t(\lambda)|}_{\text{algorithm}} } \,\,
{\color{brown}\underbrace{\vphantom{\max_{[L]}}\|\xx_0 - \xx^\star\|}_{\text{initialization}}}~.
\end{equation}
This inequality highlights the importance of being able to bound the residual polynomial,<dt-note>We'll sometimes refer to ${\max_{\lambda \in \Lambda}|P_t(\lambda)|}$ as the <i>convergence rate</i> of the algorithm.</dt-note> as doing so results in a bound for the error norm $\|\xx_t - \xx^\star\|$, which is often the goal in the analysis of optimization algorithms. This and the next 2 sections are devoted to characterizing and bounding the residual polynomial of the cyclical heavy ball method.
</p>
<p>
Let $P_{t}$ denote the residual polynomials associated with the cyclical heavy ball method at iteration $t$. By our definition of the residual polynomial in <a href="http://fa.bianp.net/blog/2020/polyopt/#sec2">Part 1</a>, the sequence of residual polynomials admits the recurrence
\begin{equation}
\begin{aligned}
&P_{t+1}(\lambda) = (1 + \mom - {\color{colorstepsize}h_{\bmod(t, 2)}} \lambda) P_{t}(\lambda) - \mom P_{t-1}(\lambda)\\
&P_0(\lambda) = 1\,,\quad P_1(\lambda) = 1 - \frac{\stepzero}{1 + \mom}\lambda\,.
\end{aligned}\label{eq:residual_cyclical_recurrence}
\end{equation}
What makes this recurrence different than other's we'e seen before is that the terms that multiply the previous polynomials depend on the iteration number $t$ through the step-size ${\color{colorstepsize}h_{\bmod(t, 2)}}$ .
This is enough to break the previous proof technique.
</p>
<p>
Luckily, we can eliminate the dependency on the iteration number by <i>chaining</i> iterations in a way that the recurrence jumps two iterations instead of one.
Evaluating the previous equation at $P_{2t + 2}$ and solving for $P_{2t + 1}$ gives
\begin{equation}
P_{2t + 1}(\lambda) = \frac{P_{2t + 2}(\lambda) + \mom P_{2t}(\lambda)}{1 + \mom - \stepone\lambda}\,.
\end{equation}
Using this to replace $P_{2t+1}$ and $P_{2t-1}$ in the original recurrence \eqref{eq:residual_cyclical_recurrence} gives
\begin{equation}
\frac{P_{2t + 2}(\lambda) + \mom P_{2t}(\lambda)}{1 + \mom - \stepone\lambda} = (1 + \mom - \stepzero \lambda) P_{2t}(\lambda) - \mom \left(\frac{P_{2t}(\lambda) + \mom P_{2t-2}(\lambda)}{1 + \mom - \stepone\lambda} \right)\,,
\end{equation}
which solving for $P_{2t + 2}$ finally gives
\begin{equation}\label{eq:recursive_formula}
P_{2t + 2}(\lambda) = \left( (1 + \mom - \stepzero\lambda)(1 + \mom - \stepone \lambda) - 2 \mom \right) P_{2t}(\lambda) - \mom ^2 P_{2t - 2}(\lambda)\,.
\end{equation}
This recurrence expresses $P_{2t+2}$ in terms of $P_{2t}$ and $P_{2t - 2}$. It's hence a recurrence for only <i>even</i> iterations, but it has one huge advantage with respect to the previous one: the coefficients that multiply the previous polynomials now don't depend on the iteration number.
</p>
<p>
Once we have this recurrence, following a similar proof to the one we did in <a href="https://fa.bianp.net/blog/2020/momentum/#sec3">Part 2</a>, we can derive an expression for the residual polynomial $P_{2t}$ based on Chebyshev polynomials of the first and second kind. This is useful because it will later make it easy to bound this polynomial –and develop convergence rates– using known bounds for Chebyshev polynomials.
</p>
<p class="theorem framed" id="theorem-goujaud" text="Goujaud et al. 2022"><dt-cite key="goujaud2021super"></dt-cite>
Consider the cyclical heavy ball method with step-sizes $\stepzero$, $\stepone$, and momentum parameter $\mom$.
Then the residual polynomial $P_{2t}$ of even degree associated with this method
can be written in terms of Chebyshev polynomials of the first ($T_{t}$) and second kind $(U_{t})$
as
\begin{align}\label{eq:theoremP2t}
P_{2t}(\lambda) &= \mom^t \left( \tfrac{2 \mom}{1 + \mom}\,
T_{2t}(\zeta(\lambda))
+ \tfrac{1 - \mom}{1 + \mom} \, U_{2t}(\zeta(\lambda))\right)\,,
\end{align}
with $\zeta(\lambda) = \frac{1+\mom}{2 \sqrt{\mom}}\sqrt{(1 - \tfrac{\stepzero}{1 + \mom}\lambda)(1 - \tfrac{\stepone}{1+\mom} \lambda) }$.<dt-note> Since $\zeta$ is evaluated only at even polynomials, the result doesn't depend on the branch of the square root function.</dt-note> <dt-note> Note $\zeta$ can be either a real or pure imaginary number. </dt-note>
</p>
<div class="wrap-collabsible"> <input id="collapsible3" class="toggle" type="checkbox"> <label for="collapsible3" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof" id="proof-cyclical">
<p>
We will prove this by induction. We will use that Chebyshev polynomials of the first kind $T_t$ and of the second kind $U_t$ follow the recurrence
\begin{align}
&T_{2t + 2}(\lambda) = (4 \lambda^2 - 2) T_{2t}(\lambda) - T_{2n - 2}(\lambda) \label{eq:recursion_first_kind}\\
&U_{2t + 2}(\lambda) = (4 \lambda^2 - 2) U_{2t}(\lambda) - U_{2n - 2}(\lambda) \label{eq:recursion_second_kind}\,,
\end{align}
with initial conditions $T_{0}(\lambda) = 1\,,~T_2(\lambda) = 2 \lambda^2 - 1$ and $U_{0}(\lambda) = 1\,,~U_2(\lambda) = 4 \lambda^2 - 1$.
</p>
<p>
Let $\widetilde{P}_t$ be the right-hand side of \eqref{eq:theoremP2t}.
We start with the case $t=0$ and $t=1$. From the initial conditions of Chebyshev polynomials we see that
\begin{align}
\widetilde{P}_0(\lambda) &= \small\dfrac{2\mom}{1 + \mom}
+ \small\dfrac{1 - \mom}{1 + \mom} = 1 = P_0(\lambda) \\
\widetilde{P}_2(\lambda) &= \mom \left( {\small\dfrac{2\mom}{1 + \mom}}
(2 \zeta(\lambda)^2 - 1)
+ {\small\dfrac{1 - \mom}{1 + \mom}} (4 \zeta(\lambda)^2 - 1)\right) \\
&= \mom \left( \dfrac{4}{1 + \mom}\zeta(\lambda)^2 - 1 \right) \\
&= 1 - (\stepzero + \stepone)\lambda +{\small\dfrac{\stepzero \stepone }{1+\mom}}\lambda^2 \,.
\end{align}
Let's now compute $P_2$ from its recursive formula \eqref{eq:residual_cyclical_recurrence}, which gives:
\begin{align}
{P}_2(\lambda) &= (1 + \mom - \stepone \lambda)P_1(\lambda) - \mom P_0(\lambda)\\
&= (1 + \mom - \stepone \lambda)(1 - {\small\dfrac{\stepzero \lambda}{1 + \mom}}) - \mom \\
&= 1 - (\stepzero + \stepone)\lambda +{\small\dfrac{\stepzero \stepone }{1+\mom}}\lambda^2 \,.
\end{align}
and so we can conclude that $\widetilde{P}_2 = P_2$.
</p>
<p>
Now lets assume it's true up to $t$. For $t+1$ we have by the recursive formula \eqref{eq:recursive_formula}
\begin{align}
P_{2t+2}(\lambda) &\overset{\text{Eq. \eqref{eq:recursive_formula}}}{=} && \left( 4 \mom \zeta(\lambda)^2 - 2 \mom \right) P_{2t}(\lambda) - \mom^2 P_{2t - 2}(\lambda)\\
&\overset{\text{induction}}{=} && \mom^{t+1}\left( 4 \zeta(\lambda)^2 - 2 \right) \left( {\small\dfrac{2 \mom}{\mom+1}}\, T_{2t}(\zeta(\lambda))
+ {\small\dfrac{1 - \mom}{1 + \mom}}\, U_{2t}(\zeta(\lambda))\right)\\
& &-& \mom^{t+1} \left( {\small\frac{2 \mom}{\mom+1}}\,
T_{2(t-1)}(\zeta(\lambda))
+ {\small\frac{1 - \mom}{1 + \mom}}\, U_{2(t-1)}(\zeta(\lambda))\right)\\
&\overset{\text{Eq. (\ref{eq:recursion_first_kind}, \ref{eq:recursion_second_kind}})}{=} && \mom^{t+1}\left( {\small\frac{2 \mom}{\mom+1}}\, T_{2(t+1)}(\zeta(\lambda)) + {\small\frac{1 - \mom}{1 +
\mom}}\, U_{2(t+1)}(\zeta(\lambda))\right)
\end{align}
where in the second identity we used the induction hypothesis and in the third one the recurrence of even Chebyshev
polynomials.
This concludes the proof.
</p>
</div></div></div></div>
<h2>It's All About the Link Function</h2>
<p>
In part 2 of this series of blog posts, we derived the residual polynomial $P^{\text{HB}}_{2t}$ of (constant step-size) heavy ball. We expressed this polynomial in terms of Chebyshev polynomials as
\begin{equation}
P^{\text{HB}}_{2t}(\lambda) = {\color{colormomentum}m}^{t} \left( {\small\frac{2
{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
T_{2t}(\sigma(\lambda))
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
U_{2t}(\sigma(\lambda))\right)\,.
\end{equation}
with $\sigma(\lambda) \defas {\small\dfrac{1}{2\sqrt{{\color{colormomentum}m}}}}(1 +
{\color{colormomentum}m} -
{\color{colorstepsize} h}\,\lambda)$.
It turns out that the above is <i>essentially</i> the same polynomial as the one associated with the cyclical heavy ball in the previous theorem \eqref{eq:theoremP2t}, the only difference being the term inside the Chebyshev polynomials (denoted $\sigma$ for the constant step-size and $\zeta$ for the cyclical step-size variant). We'll call these <q>link functions</q>.
</p>
<p>
In the case of constant step-sizes, this is a linear function $\sigma(\lambda)$, while for the cyclical step-size formula is the more complicated function $\zeta(\lambda)$:
\begin{align}
\sigma(\lambda) &= {\small\dfrac{1}{2\sqrt{{\color{colormomentum}m}}}}(1 +
{\color{colormomentum}m} -
{\color{colorstepsize} h}\,\lambda) & \text{ (constant step-sizes)} \\
\zeta(\lambda) &= {\small\dfrac{1+\mom}{2 \sqrt{\mom}}}\sqrt{\big(1 - {\small\dfrac{\stepzero}{1 + \mom}}\lambda\big)\big(1 - {\small\dfrac{\stepone}{1+\mom}} \lambda\big) }
& \text{(cyclical step-sizes)}
\end{align}
Understanding the differences between these two link functions is key to understanding cyclical step sizes.
</p>
<p>
While the link function for the constant step-size heavy ball is always real-valued, the link function of the cyclical variant is complex-valued. Hence, to provide a meaningful bound on the residual polynomial we need to understand the behaving of Chebyshev polynomials in the complex plane.
</p>
<h3>The two faces of Chebyshev polynomials</h3>
<p>
It turns out that Chebyshev polynomials always grow exponentially in the degree outside of the real segment $[-1, 1]$. In other words, Chebyshev polynomials have two clear-cut regimes: linearly bounded in $[-1, 1]$, and exponentially diverging outside.
</p>
<figure class="fullwidth">
<img src="/images/2022/complex_chebyshev.png" alt="">
</figure>
<p><span class="marginnote">
Magnitude of Chebyshev polynomials of the first kind in the complex plane. $x$ and $y$ coordinates represent the (complex) input, magnitude of the output is displayed on the $z$ axis, while the argument is represented by the color. <br><br>
Zeros of Chebyshev polynomials are concentrated in the real $[-1, 1]$ segment, and the polynomials diverge exponentially fast outside it.
<a href="/uploads/2022/chebyshev_complex.nb"><img src="/images/2022/mathematica.svg" alt=""></a>
<!-- <br><br>
<a href="/uploads/2022/chebyshev_complex.nb"><img src="/images/2022/mathematica.svg" alt="">Mathematica notebook</a> -->
</span></p>
<p>
The following lemma formalizes this for the convex combination of Chebyshev polynomials $\frac{2{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}\, T_{t}(z)
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, U_{t}(z)
$ that appear in the heavy ball method:
</p>
<p class="lemma framed" id="lemma-Chebyshev">
Let $z$ be any complex number, the sequence
$
\left(\big|
\small\frac{2{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}\, T_{t}(z)
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, U_{t}(z)
\big|\right)_{t \geq 0}
$
grows exponentially in $t$ for $z \notin [-1, 1]$, while in that interval they are bounded as $|T_t(z)| \leq 1, |U_t(z)| \leq t+1$.
</p>
<div class="wrap-collabsible"> <input id="collapsible4" class="toggle" type="checkbox"> <label for="collapsible4" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof" id="proof-Chebyshev">
<p>
For this proof we'll use the following explicit expression for Chebyshev polynomials, valid for any complex number $z$ other than 1 and -1: <dt-note>Although we won't use it in this post, these asymptotic bounds hold for a large class of orthogonal polynomials, see next reference for instance.</dt-note> <dt-cite key="stahl1990nth"></dt-cite>
\begin{align}
T_t(z) = &~ \frac{\xi^t + \xi^{-t}}{2} \nonumber \\
U_t(z) = &~ \frac{\xi^{t+1} - \xi^{-(t+1)}}{\xi - \xi^{-1}} \label{eq:t_and_u}\,.
\end{align}
with $\xi \defas z + \sqrt{z^2 - 1}$, which implies $\xi^{-1} = z - \sqrt{z^2 - 1}$.<dt-note>Here and in the previous equation $\sqrt{\cdot}$ can denote any branch of the square root function. By symmetry, the output will be the same for any branch we choose.</dt-note>
By combining the 2 previous expressions of the polynomials $T_t$ and $U_t$, we obtain
\begin{equation}\label{eq:chebyshevs_xi}
\small\frac{2{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}\, T_{t}(z)
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, U_{t}(z)
=
\frac{(\xi^2 - \mom)\xi^t + (\mom \xi^2 - 1)\xi^{-t}}{(1+\mom)(\xi^2-1)}\,.
\end{equation}
</p>
<p>
This expression has 2 exponential terms. Depending on the magnitude of $\xi$, we have the following behavior:
<dt-note> Note Equation \eqref{eq:chebyshevs_xi} is not well defined for $\xi = \pm 1$ which corresponds exactly to $z = \pm 1$ already excluded from Equation \eqref{eq:t_and_u} and that we treat later.</span></dt-note>
</p>
<ul>
<li>If $|\xi|\gt 1$, then the first term grows exponentially fast.</li>
<li>If $|\xi|\lt 1$, then the second term grows exponentially fast.</li>
<li>If $|\xi|=1, |\xi| \neq \pm 1$, then the sequence of interest is bounded.</li>
</ul>
<p>
Therefore,
$
\left(\big|
\small\frac{2{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}\, T_{t}(z)
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, U_{t}(z)
\big|\right)_{t \geq 0}
$
is bounded only when $|\xi|=1$ and grows exponentially fast otherwise. We're still left with the task of finding the complex numbers $z$ for which $\xi$ has unit modulus.
</p>
<p>
For this, note that when $\xi$ has a modulus 1, then its inverse $z - \sqrt{z^2 - 1}$ is also its complex conjugate.
This implies that the real part of $\xi$ and $\xi^{-1}$ coincide, and so $\sqrt{z^2 - 1}$ is purely imaginary and $z \in (-1, 1)$.
Finally, for the case $z \in \{ -1, +1\}$ which we excluded at the beginning of the proof, we can use the known bounds $|T_t(\pm 1)| \leq 1$, $|U_t(\pm 1)| \leq (t+1)$ to conclude that it does not grow exponentially.
</p>
<!-- <p>
Note the 2 cases $z=-1$ and $z=1$ were excluded from this analyses but can be studied separately.
We know that the studied sequence is neither bounded, nor growing exponentially fast on $1$ and $-1$,
but possess an intermediate behavior, being linearly upper bounded in $t$, which concluded the proof.
</p> -->
</div></div></div></div>
<h3>A closer look into the link function</h3>
<p>
<!--<span class="marginnote">TODO: describe this figure</span>-->
Given the previous lemma that shows that our combination of Chebyshev polynomials grows exponentially in the degree outside of $[-1, 1]$, we seek to understand when the image of the link function lies in this interval
since it's in this regime where we'll expect to achieve the best rates. We'll call the set of parameters that verify this condition the <span class="tufte-underline">robust region</span>.<dt-note>We explored this region in <a href="/blog/2020/momentum/#sec4">Part 2</a> of this series in the context of the classical heavy ball method. The name is in reference to the step-size stability of the method: inside this region, the rate of convergence is <i>independent of the step-sizes</i> and only depends on the momentum term. </dt-note>
</p>
<p>
The link function $\sigma$ for the constant ste-size heavy ball is a linear function, so its preimage $\sigma^{-1}([- 1, 1])$ is also an interval. This explains why most analyses of the heavy ball assume the eigenvalues belonging to a single interval.
</p>
<figure>
<img style="max-width: 500px; background-color: #fffff8;" src="/images/2022/link_function_constant.png" alt="">
</figure>
<p>
While the preimage $\sigma^{-1}([-1, 1])$ associated with the constant step-size link function is always an interval,
things become more interesting for the cyclical link function $\zeta$. In this case, the preimage is not always an interval:
</p>
<figure>
<span class="marginnote">
The preimage $\zeta^{-1}([-1, 1])$ of the cyclical step-size link function is a union of two disjoint intervals. $\zeta(\lambda)$ is larger than 1 to the left of the left interval
and smaller than -1 to the right of the right interval. In between both intervals,
$\zeta(\lambda)$ is purely imaginary.
<br><br> For visualization purposes, we choose the branch of the square root that agrees in sign with $(1 + \mom - \stepone \lambda)$. This doesn't affect the rate, since this term only appears squared in the residual polynomial expression. <br><br>
<a href="https://colab.research.google.com/github/google-research/google-research/blob/master/sobolev/examples/cyclical_learning_rates.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
<img style="max-width: 500px; background-color: #fffff8;" src="/images/2022/link_function_cyclical.png" alt="">
</figure>
<p>
The preimage $\zeta^{-1}([-1, 1])$ associated with the cyclical step-size link function is always a union of two intervals.
Chihara<dt-cite key="chihara1968orthogonal"></dt-cite> <dt-note><img style="display: block; margin: 0 auto; max-width: 200px; box-shadow: 6px 6px 3px grey;" src="/images/2021/Chihara.jpg" alt=""> <br> Theodore Seio Chihara is known for his research in the field of orthogonal polynomials and
special functions. He wrote the very influential book <a href="https://books.google.ca/books/about/An_Introduction_to_Orthogonal_Polynomial.html?id=71CVAwAAQBAJ&redir_esc=y">An Introduction to Orthogonal Polynomials</a>.
The family of <a href="https://en.wikipedia.org/wiki/Al-Salam%E2%80%93Chihara_polynomials">Al-Salam-Chihara</a> and <a href="https://en.wikipedia.org/wiki/Brenke%E2%80%93Chihara_polynomials">Brenke-Chihara</a> polynomials are named in his honor.</dt-note> worked out the exact formula for this preimage, which turns out to be a <i>union</i> of two intervals of the same size:
\begin{equation}
\begin{aligned}
\zeta^{-1}([-1, 1]) = & \left[\frac{(1+\mom)(\stepzero + \stepone)-\Delta}{2 \stepzero \stepone}, \frac{1+\mom}{\stepone}\right]\\
& \bigcup \left[\frac{1+\mom}{\stepzero}, \frac{(1+\mom)(\stepzero + \stepone)+\Delta}{2 \stepzero \stepone}\right]\,,
\end{aligned}\label{eq:preimage}
\end{equation}
with $\Delta \defas \sqrt{(1+\mom)^2 (\stepzero + \stepone)^2- 4 (1 - \mom)^2 \stepzero \stepone}$.
</p>
<p>
<span class="marginnote">
<img style="background-color: #fffff8;" src="/images/2022/CyclicalResidualPolynomial.png" alt=""> <br>
One of the main differences between polynomials with constant and cyclical coefficients is the behavior of their zeros. While the zeros of orthogonal polynomials with constant coefficients concentrate in an interval, those of orthogonal polynomials with varying coefficients concentrate instead on a <i>union</i> of two disjoint intervals.
</span>
This result has profound implications. First, remember that because of the exponential divergence of Chebyshev polynomials outside the $[-1, 1]$ interval, we'd like the image of the Hessian eigenvalues to be in this $[-1, 1]$ interval. The previous result says that the eigenvalue support set that can guarantee this is no longer a single interval (as was the case for the constant step-size heavy ball) but a <i>union</i> of two disjoint intervals. We'll explore the implications of this model in the next section.
</p>
<h2>A Refined Model of the Eigenvalue Support Set</h2>
<p>
In the following we'll assume that the eigenvalues of $\HH$ lie in a set $\Lambda$ that is a <i>union of two intervals</i> of equal size:
\begin{align}
\Lambda \defas [\muone, \Lone] \cup [\mutwo, \Ltwo]\,,
\end{align}
with $\Lone - \muone = \Ltwo - \mutwo$. Note that this parametrization is strictly more general than the classical assumption that the eigenvalues lie within an interval since we can always take $\Lone = \mutwo$, in which case $\Lambda$ becomes the $[\muone, \Ltwo]$ interval.
</p>
<p>
A quantity that will be crucial later on is the <i>relative gap</i> or eigengap:
\begin{equation}
R \defas \frac{\mutwo - \Lone}{\Ltwo - \muone}\,.
\end{equation}
This quantity measures the size of the gap between the two intervals. $R=0$ implies that the two intervals are adjacent ($\Lone = \mutwo$ and the set $\Lambda$ is in practice a single interval), while $R=1$ implies instead that all eigenvalues are concentrated around the largest and smallest value.
</p>
<p>
It turns out that many deep neural networks Hessians fit into this model with a non-trivial gap.<dt-cite key="sagun2016eigenvalues"></dt-cite> <dt-cite key="sagun2017empirical"></dt-cite> <dt-cite key="papyan2020traces"> </dt-cite> <dt-cite key="ghorbani2019investigation"></dt-cite> And we don't have to look too far. The venerable MNIST exhibits this behavior with a quadratic objective, which contains an outlier eigenvalue leading to a non-trivial relative gap:
</p>
<figure>
<span class="marginnote"> <br><br>Hessian eigenvalue histogram for a quadratic
objective on MNIST. The outlier eigenvalue at $\Ltwo$ generates a non-zero relative gap R = 0.77. In this case,
the 2-cycle heavy ball method has a faster asymptotic
rate than the single-cycle one (see Section 3.2)
<br><br>
<a href="https://colab.research.google.com/github/google-research/google-research/blob/master/sobolev/examples/cyclical_learning_rates.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
<img style="max-width: 500px; background-color: #fffff8;" src="/images/2022/spectrum_mnist.png" alt="">
</figure>
<p>
More complex eigenvalue structures are equally possible to model with longer cycle lengths. While we won't go into these techniques in this blog post, some are discussed in <a href="https://arxiv.org/pdf/2106.09687.pdf">the paper</a>.
</p>
<h2>A Growing Robust Region</h2>
<p>
The set of parameters for which the image of the link function is in the $[-1, 1]$ interval is called the <i>robust region</i>.<dt-note>We've discussed the robust region in detail in <a href="/blog/2021/hitchhiker/">Part 3</a> of this series. Two of the most important properties are $i)$ the optimal convergence rate is achieved in this region, and $ii)$ the step-size doesn't influence the convergence rate in this region, which is fully determined by the momentum parameter.</dt-note>
Using known bounds of Chebyshev polynomials $|T_t(z)| \leq 1$ and $|U_t(z)| \leq t+1$ for $z \in [-1, 1]$, we can compute the convergence rate in this region. In particular, plugging these bounds into the characterization of cyclical heavy ball's residual polynomial in Goujaud's theorem we have:
\begin{equation}
r_{2t} \defas \max_{\lambda \in \Lambda}|P_{2t}(\lambda)| \leq \mom^{t} \left( 1
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, 2 t\right)\,.
\end{equation}
Asymptotically, the first exponential term $\mom^{t}$ dominates, and we have the asymptotic rate
\begin{equation}\label{eq:asymptotic_rate}
\limsup_{t \to \infty} \sqrt[2t]{r_{2t}} = \sqrt{\mom}
\end{equation}
</p>
<p>
Surprisingly, this $\sqrt{\mom}$ asymptotic rate is <a href="/blog/2020/momentum/#sec4">the same expression</a> for constant step-size heavy ball!. It might seem we've done all this work for nothing ...
</p>
<p>
Fortunately, this is not true, our work has not been in vain.
There <i>is</i> a speedup, but the reason is subtle. While it's true that the expression for the convergence rate is the same for constant step-size and cyclical, the <i>robust region</i> is not the same. Hence, the minimum momentum that we can plug in this expression is not the same, potentially resulting in a different asymptotic rate.
</p>
<!--
<p>
For an illustration, consider the following parametrization of our two step-sizes as a function of an $\textcolor{olive}{r} \in (0, 1)$ parameter:<dt-note>We use this parametrization to be able to visualize the region in 2D. Contrary to the constant step-size method where we could visualize the <a href="/blog/2021/hitchhiker/">entire robust region</a>, in this case the full robust region would be a region in 3D space.</dt-note>
\begin{align}
\stepzero &= {\small\frac{2(1+\mom)}{ (L + \mu) - \textcolor{olive}{r} (L - \mu)}} \\
\stepone &= {\small\frac{2(1+\mom)}{ (L + \mu) + \textcolor{olive}{r} (L - \mu)} }
\end{align}
In this parametrization, $\textcolor{olive}{r}=0$ yields constant step-sizes $\stepzero = \stepone$, and the difference between both step-sizes increases with $\textcolor{olive}{r}$.
We'll plot the set of parameters $(\mom, \textcolor{olive}{r})$ for which we stay in the robust region and vary the relative gap $R$ constant:
</p>
<figure class="fullwidth">
<img src="/images/2022/robust_region_cyclical.png" alt="">
</figure>
<p>
As we can see from this plot, the robust region grows with the relative gap $R$. Furthermore, the convergence rate improves (darker color) as the relative gap increases.
</p> -->
<!-- <p>
A consequence of above formula is that the best asymptotic rate is achieved by the smallest momentum parameter (within the robust region). This raises the question: what is the smallest momentum parameter that we can choose while staying in the robust region?
</p> -->
<p>
We now seek to find the smallest momentum parameter we can take while staying within the robust region, which is to say while the pre-image $\zeta^{-1}([-1, 1])$ contains the eigenvalue set $\Lambda$.
</p>
<p>
The area of that pre-image can be computed from the formula \eqref{eq:preimage}, and it turns out that this area factors as a product of two terms that are increasing in $\mom$:
\begin{align}
& \frac{\sqrt{(1+\mom)^2 (\stepone+\stepzero)^2- 4 (1 - \mom)^2 \stepzero \stepone}}{\stepzero \stepone} + (1+\mom)(\frac{1}{\stepone} - \frac{1}{\stepzero}) \\
&= \underbrace{\vphantom{\left[ \sqrt{(\stepone + \stepzero)^2 - 4 \left(\frac{1-\mom}{1+\mom}\right)^2} + \stepzero - \stepone \right]}\left[\frac{1 + \mom}{\stepzero \stepone} \right]}_{\text{positive increasing}}\underbrace{\left[ \sqrt{(\stepone + \stepzero)^2 - 4 \left(\frac{1-\mom}{1+\mom}\right)^2} + \stepzero - \stepone \right]}_{\text{positive increasing}}\,.
\end{align}
</p>
<p>
<dt-note>
The second term is increasing since $(\frac{1-\mom}{1+\mom})^2$ is decreasing, and has a minimum at $\mom=0$ where its zero. Since a product of two increasing and non-negative functions is increasing, we conclude that the area is an increasing function of $\mom$.
</dt-note>
Since the are of the robust region is increasing with $\mom$, the smallest value of $\mom$ while staying in the robust region is achieved when this region coincides with the eigenvalue support $\Lambda$.<dt-note>Making the region exactly coincide with the eigenvalue support set is possible thanks to the assumption that the two intervals are of the same size</dt-note> Hence, for the optimum momentum, the inequalities that determine the robust region become equalities, leading to the system
\begin{equation}
\begin{aligned}
\zeta(\muone) = 1\,,\quad
\zeta(\Lone) = 0 \,,\quad\\
\zeta(\mutwo) = 0 \,,\quad
\zeta(\Ltwo) = -1 \,.
\end{aligned}
\end{equation}
</p>
<p>
From the second and third we get the optimal step-size parameters
\begin{equation}
\stepzero = \frac{1+\mom}{\mutwo}~,\quad \stepone = \frac{1+\mom}{\Lone}\,.
\end{equation}
The value of these optimal step-sizes is illustrated in the figure below, where we can see
</p>
<figure class="fullwidth">
<img style="background-color: #fffff8;" src="/images/2022/rate_convergence_cyclical.png" alt="">
</figure>
<p>
<span class="marginnote"> <br><br>Convergence rate (in color) for every choice of step-size parameters. The optimal step-sizes are highlighted in orange. <br><br> We can see how the optimal step-sizes are identical for a trivial eigengap $(R=0)$, and grow further apart with $R$. <br><br>
<a href="https://colab.research.google.com/github/google-research/google-research/blob/master/sobolev/examples/cyclical_learning_rates.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
Replacing into the first equation (or fourth, they are redundant)
and moving all momentum terms to one side we have
\begin{align}
\big({\small\dfrac{1 + \mom}{2 \sqrt{\mom}}}\big)^2
& = {\small\dfrac{\Lone \mutwo}{(\Lone - \muone)(\mutwo - \muone)}}\,.
\end{align}
Let $\rho \defas \frac{\Ltwo+\muone}{\Ltwo-\muone} = \frac{1 + \kappa}{1-\kappa}$ be the inverse of the linear convergence rate of the Gradient Descent method.
With this notation, we can rewrite the above equation as
\begin{equation}
\Big({\small\dfrac{1 + \mom}{2 \sqrt{\mom}}}\Big)^2 = \frac{\rho^2 - R^2}{1 - R^2}\,.
\end{equation}
Solving for $\mom$ finally gives
\begin{equation}
\mom = \Big( \frac{\sqrt{\rho^2 - R^2} - \sqrt{\rho^2 - 1}}{\sqrt{1 - R^2}}\Big)^2\,.
\end{equation}
</p>
<p>
With these parameters, we can now write the cyclical heavy ball method with optimal parameters. This is a method that requires knowledge of the smallest and largest Hessian eigenvalue and the relative gap $R$:
</p>
<p class="framed">
<b class="tufte-underline">Cyclical Heavy Ball with optimal parameters</b><br>
<b>Input</b>: starting guess $\xx_0$ and eigenvalue bounds $\muone, \Lone, \mutwo, \Ltwo$.<br>
<b>Set</b>: $\rho = \frac{\Ltwo+\muone}{\Ltwo-\muone}$, $R=\frac{\mutwo - \Lone}{\Ltwo - \muone}$, <br>
$\mom = \big( \frac{\sqrt{\rho^2 - R^2} - \sqrt{\rho^2 - 1}}{\sqrt{1 - R^2}}\big)^2$
<br>
$\xx_1 = \xx_0 - \frac{1}{\mutwo} \nabla f(\xx_0)$ <br>
<b>For</b> $t=1, 2, \ldots$ <br>
\begin{align*}\label{eq:momentum_update_optimal}
&\text{${\color{colorstepsize}h_t} = \textstyle\frac{1+\mom}{\mutwo}$ if $t$ is even and ${\color{colorstepsize}h_t} = \textstyle\frac{1+\mom}{\Lone}$ otherwise}\\
&\xx_{t+1} = \xx_t - {\color{colorstepsize}h_t} \nabla
f(\xx_t)+ {\color{colormomentum} m}(\xx_{t} - \xx_{t-1})
\end{align*}
</p>
<h2>Convergence Rates</h2>
<p>
We've seen in the previous section that in the robust region, the convergence rate is a simple function of the momentum parameter. The only thing left to do is then to plug our optimal parameter into this formula \eqref{eq:asymptotic_rate}. The following corollary provides it and an asymptotic version.
</p>
<p class="corollary framed" id="corollary-goujaud" text="Goujaud et al. 2022">
Let $x_t$ be the iterate generated by the above algorithm after an even number of steps $t$ and let $\rho \defas \frac{\Ltwo+\muone}{\Ltwo-\muone} = \frac{1 + \kappa}{1-\kappa}$ be the inverse of the linear convergence rate of the Gradient Descent method. Then we can bound the iterate suboptimality by a product of two terms, the first of which decreases exponentially fast, while the second one increases linearly in $t$:
\begin{equation}
\frac{\|x_t - x_\star\|}{\|x_0 - x_\star\|} \leq \underbrace{\vphantom{\sum_i}\textstyle\Big(\frac{\sqrt{\rho^2 - R^2} - \sqrt{\rho^2 - 1}}{\sqrt{1 - R^2}}\Big)^{t} }_{\text{exponential decrease}} \, \underbrace{\Big( 1 + t\sqrt{\small\dfrac{\rho^2-1}{\rho^2 - R^2}}\Big)}_{\text{linear increase}}\,.
\end{equation}
Asymptotically, the exponential term dominates, and we have the following asymptotic rate factor:
\begin{equation}
\label{eq:cyclic_rate}
\limsup_{t \to \infty} \sqrt[2t]{\frac{\|x_{2t} - x_\star\|}{\|x_0 - x_\star\|}} \leq \frac{\sqrt{\rho^2 - R^2} - \sqrt{\rho^2 - 1}}{\sqrt{1 - R^2}}\,.
\end{equation}
</p>
<p>
How do these compare with those of the constant step-size heavy ball method with optimal parameters (Polyak heavy ball)?
The asymptotic rate of Polyak heavy ball is $r^{\text{PHB}} \defas \rho - \sqrt{\rho^2 - 1}$, and so we'd like to compare
\begin{equation}\label{eq:polyak_rate}
\begin{aligned}
&r^{\text{CHB}} = \frac{\sqrt{\rho^2 - R^2} - \sqrt{\rho^2 - 1}}{\sqrt{1 - R^2}}
\\
\text{vs } &r^{\text{PHB}} \defas \rho - \sqrt{\rho^2 - 1}\,.
\end{aligned}
\end{equation}
</p>
<p>
The first thing we can say is that both coincide when $R=0$, and that $r^{\text{CHB}}$ is a decreasing function of $R$, so the rate improves with the relative gap. Other than that, it's not clear how to summarize the improvement by a simpler formula.
</p>
<p>
On the ill-conditioned regime, where the inverse condition number goes to zero ($\kappa \defas \tfrac{\mu}{L} \to 0$), which is also the main case of interest, we'll be able to provide an approximation that sheds some light. In this regime, we have the following rates (lower is better): <dt-note>The asymptotic rate of gradient descent in this regime is $1 - 2 \kappa + o(\kappa)$.</dt-note>
\begin{align}
&r^{\text{PHB}} \underset{\kappa \rightarrow 0}{=} 1 - 2\sqrt{\kappa} + o(\sqrt{\kappa}) \\
&r^{\text{CHB}} \underset{\kappa \rightarrow 0}{=} 1 - 2\frac{\sqrt{\kappa}}{{\color{red}\sqrt{1-R^2}}} + o(\sqrt{\kappa})\,,\label{eq:approx}
\end{align}
where $o(\sqrt{\kappa})$ contains which to go zero faster than $\sqrt{\kappa}$ as $\kappa \rightarrow 0$.
This sheds some light on the improvement: the cyclical scheduler allows us to divide the complexity of the method by a factor of $\frac{1}{\color{red}\sqrt{1-R^2}}\,$!
</p>
<p>
The figure below compares the convergence rate of Polyak heavy ball (<span style="color: #1f77b4">Polyak</span>), the cyclical heavy ball (<span style="color: #ff7f0e">Cyclical</span>) and its approximated rate \eqref{eq:approx} (<span style="color: #2ca02c">Approx</span>, in dashed lines). We see Cyclical behaving like Polyak for well-conditioned problems (small $\kappa$) and then
fitting more the super-accelerated rate (Approx) for ill-conditioned problems. This plot also shows that the approximated rate is a very good approximation for $\kappa \lt 0.001$.
</p>
<figure>
<span class="marginnote"> <br><br>Convergence rate comparison between <span style="color: #1f77b4">Polyak</span> heavy ball, <span style="color: #ff7f0e">Cyclical</span> heavy ball, and its <span style="color: #2ca02c">approximate</span> convergence rate. <br><br>
We can see how the approximated rate provides an excellent fit in the ill-conditioned regime.
<br><br>
<a href="https://colab.research.google.com/github/google-research/google-research/blob/master/sobolev/examples/cyclical_learning_rates.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
<img style="background-color: #fffff8;" src="/images/2022/asymptotic_rate.png" alt="">
</figure>
<h2>Conclusion and Perspectives</h2>
<p>
In this post, we've shown that cyclical learning rates provide an effective and practical way to leverage a gap or bi-modal structure in the Hessian's eigenvalues. We've also derived optimal parameters and shown that in the ill-conditioned regime, this method achieves an extra rate factor improvement of $\frac{1}{\color{red}\sqrt{1-R^2}}$ over the accelerated rate of Polyak heavy ball, where $R$ is the relative gap in the Hessian eigenvalues.
</p>
<p>
The combination of a structured Hessian, momentum and cyclical learning rates is immensely powerful, and here we're only scratching the surface of what's possible.
Many important cases have not yet been studied, such as the non-quadratic setting (strongly convex, convex, and non-convex). In the paper we have empirical results logistic regression showing speedup of the cyclical heavy ball method, but no proof beyond local convergence. Equally open is the extension to stochastic algorithms.
</p>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing as
</p>
<blockquote>
<a href="http://fa.bianp.net/blog/2022/cyclical/">Cyclical Step-sizes</a>, Baptiste Goujaud and Fabian Pedregosa, 2022
</blockquote>
<p>
with bibtex entry:
</p>
<pre>
<code>
@misc{goujaud2022cyclical,
title={Cyclical Step-sizes},
author={Goujaud, Baptiste and Pedregosa, Fabian},
howpublished = {\url{http://fa.bianp.net/blog/2022/cyclical/}},
year={2022}
}
</code>
</pre>
<p>
To cite its accompanying full-length paper, please use:
</p>
<blockquote>
<a href="https://arxiv.org/pdf/2106.09687.pdf">
Super-Acceleration with Cyclical Step-sizes</a>, Baptiste Goujaud, Damien Scieur, Aymeric Dieuleveut, Adrien Taylor, Fabian Pedregosa. <i>Proceedings of The 25th International Conference on Artificial Intelligence and Statistics</i>, 2022
</blockquote>
<p>
Bibtex entry:
</p>
<pre>
<code>
@inproceedings{goujaud2022super,
title={Super-Acceleration with Cyclical Step-sizes},
author={Goujaud, Baptiste and Scieur, Damien and Dieuleveut, Aymeric and Taylor,
Adrien and Pedregosa, Fabian},
booktitle = {Proceedings of The 25th International Conference on Artificial
Intelligence and Statistics},
series = {Proceedings of Machine Learning Research},
year={2022}
}
</code>
</pre>
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
Optimization Nuggets: Implicit Bias of Gradient-based Methods2022-01-10T00:00:00+01:002022-01-10T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2022-01-10:/blog/2022/implicit-bias-regression/
<p>
When an optimization problem has multiple global minima, different algorithms can find different solutions, a phenomenon often referred to as the <i>implicit bias</i> of optimization algorithms. In this post we'll characterize the implicit bias of gradient-based methods on a class of regression problems that includes linear least squares and Huber …</p>
<p>
When an optimization problem has multiple global minima, different algorithms can find different solutions, a phenomenon often referred to as the <i>implicit bias</i> of optimization algorithms. In this post we'll characterize the implicit bias of gradient-based methods on a class of regression problems that includes linear least squares and Huber loss regression.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{gunasekar2018characterizing,
title={Characterizing implicit bias in terms of optimization geometry},
author={Gunasekar, Suriya and Lee, Jason and Soudry, Daniel and Srebro, Nathan},
journal={International Conference on Machine Learning},
year={2018},
organization={PMLR},
url={https://arxiv.org/pdf/1802.08246.pdf}
}
@article{soudry2018implicit,
title={The implicit bias of gradient descent on separable data},
author={Soudry, Daniel and Hoffer, Elad and Nacson, Mor Shpigel and Gunasekar, Suriya and Srebro, Nathan},
journal={The Journal of Machine Learning Research},
year={2018},
url={https://arxiv.org/pdf/1710.10345.pdf}
}
@article{telgarsky2013margins,
title={Margins, shrinkage, and boosting},
author={Telgarsky, Matus},
journal={International Conference on Machine Learning},
pages={307--315},
year={2013},
organization={PMLR},
url={https://arxiv.org/pdf/1303.4172.pdf}
}
@article{wilson2017marginal,
title={The marginal value of adaptive gradient methods in machine learning},
author={Wilson, Ashia C and Roelofs, Rebecca and Stern, Mitchell and Srebro, Nathan and Recht, Benjamin},
journal={Advances in Neural Information Processing Systems},
year={2017},
url={https://arxiv.org/pdf/1705.08292.pdf}
}
@article{spingarn1987projection,
title={A projection method for least-squares solutions to overdetermined systems of linear inequalities},
author={Spingarn, Jonathan E},
journal={Linear Algebra and its Applications},
volume={86},
pages={211--236},
year={1987},
publisher={Elsevier}
}
@article{herman1978relaxation,
title={Relaxation methods for image reconstruction},
author={Herman, Gabor T and Lent, Arnold and Lutz, Peter H},
journal={Communications of the ACM},
volume={21},
number={2},
pages={152--158},
year={1978},
publisher={ACM New York, NY, USA},
url={https://sci-hub.yncjkj.com/10.1145/359340.359351}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
//document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none;">
$$
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\ell}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colorspace}{RGB}{77,175,74}
\definecolor{colorspan}{RGB}{228,26,28}
\definecolor{colorcomplement}{RGB}{55,126,184}
\def\spanA{{\color{colorspan}\mathcal{A}}}
\def\spanAC{{\color{colorcomplement}\mathcal{A}^{\perp}}}
\def\ProjAC{{\color{colorcomplement}P_{\mathcal{A}^{\perp}}}}
$$
</div>
<p>
Consider the optimization problem where the objective function $f$ is a generalized linear model with data matrix $A \in \RR^{n \times p}$ and target vector $b \in \RR^n$. Denoting by $A_1, \ldots, A_n$ the row vectors of the data matrix, we can write this problem as
\begin{equation}\label{eq:opt}
\argmin_{x \in \RR^p} \left\{f(x) \defas \sum_{i=1}^n \varphi (A_i^\top x, b_i) \right\}\,,
\end{equation}
where $\varphi(z, y)$ is a differentiable real-valued function verifying the <q>unique finite root condition</q>, which is that it is only minimized for $z=y$. These losses are usually used for regression and includes the least squares $\varphi(z, y) = (z - y)^2$ and <a href="https://en.wikipedia.org/wiki/Huber_loss">Huber losses</a>. We'll further assume that the linear system $A x = b$ is <i>under-specified</i>, that is, that it admits more than one solution. This assumption is verified for example –but not limited to– over-parametrized models ($p > n$) with a full rank data matrix.
</p>
<p>
Problems of this form verify two key properties that make it easy to characterize the bias of gradient-based methods. By gradient-based methods I mean any method in which the updates are given by a linear combination of current and past gradients. This includes gradient descent, gradient descent with momentum, stochastic gradient descent (SGD), SGD with momentum, Nesterov's accelerated gradient method. It does not include however quasi-Newton methods or diagonally preconditioned methods such as Adagrad or Adam.
</p>
<p><span class="tufte-underline">Property 1: Iterates remain in the span of the data</span>. The gradient of the $i$-th sample $\varphi (A_i^\top x, b_i)$ has the same direction as its data sample $A_i$. If we denote by $\varphi'$ the derivative of $\varphi$ with respect to its first argument, then we have
\begin{equation}
\nabla_x \left[\varphi (A_i^\top x, b_i)\right] = \underbrace{\vphantom{\varphi_i'A_i}A_i}_{\text{vector}} \underbrace{\varphi'(A_i^\top x, b_i)}_{\text{scalar}} \,.
\end{equation}
This implies that any gradient-based method generates updates that stay in the span of the vectors $\{A_1, \ldots, A_n\}$.
</p>
<p>
It's no surprise then that the vector space generated by the samples $A_1, \ldots, A_n$ plays a crucial role here. For convenience we'll denote this subspace by
\begin{equation}
\spanA \defas \textcolor{colorspan}{\span\{A_1, \ldots, A_n\}} \,.
\end{equation}
Another subspace that will play a crucial role is the orthogonal complement of $\spanA$, which we denote $\spanAC$. <dt-note>We'll denote in <span style="color: #e41a1c">red</span> (<span style="color: #377eb8">blue</span>) both the vector space $\spanA$ (its orthogonal complement $\spanAC$) and its elements.</dt-note>
</p>
<p><span class="tufte-underline">Property 2: minimizers solve the linear system $A x = b$.</span> By the unique root condition of $\varphi$, the global minimizer of \eqref{eq:opt} is achieved when $A_i^\top x = b_i$ for all $i$. In other words, the global minimizers are the solutions to the linear system $A x = b$, a set that is non-empty by the under-specification assumption.
</p>
<p>
With these ingredients we can characterize the implicit bias in the following Theorem:
</p>
<p class="theorem framed" text="Gradient methods solve a minimal distance problem">
Gradient-based methods started from $x_0$ converge to the solution with smallest distance to $x_0$. More precisely, assume that the iterates of a gradient-based method converge to a solution of \eqref{eq:opt}, and let $x_\infty \defas \lim_{t \to \infty} x_{t} $ denote this limit. Then $x_{\infty}$ can be expressed as
\begin{equation}
x_{\infty} = \textcolor{colorspan}{A^\dagger b} + \textcolor{colorcomplement}{(I - P)x_0}\,,
\end{equation}
where $^\dagger$ denotes the matrix pseudoinverse and $\textcolor{colorcomplement}{(I - P)}$ is the orthogonal projection onto the orthogonal complement of $\textcolor{colorspan}{\mathcal{A}}$.
Furthermore, this is the solution with smallest distance to $x_0$, that is,
\begin{equation}\label{eq:minimal_traveling}
x_{\infty} = \argmin_{x \in \RR^p} ||x - x_0||_2 ~\text{ subject to } A x = b \,.
\end{equation}
</p>
<div class="proof">
<p>
I've split the proof into two parts. The first part characterizes the limit iterate $x_{\infty}$, while the second part shows that the limit iterate solves a minimal distance problem.
</p>
<p>
<span class="tufte-underline">Characterizing the limit iterate.</span> The main argument here is to show that the limit iterate belongs to the intersection of two affine spaces and then compute their intersection.
By Property 2, the limit iterate must solve the linear system $A x_{\infty} = b$.
A classical <a href="https://en.wikipedia.org/wiki/System_of_linear_equations#Relation_to_nonhomogeneous_systems">linear algebra result</a> states that all solutions of this problem are of the form $x + {\textcolor{colorcomplement}c}$, with $x$ any solution of $A x = b$ and $\textcolor{colorcomplement}{c \in \spanAC}$. Let's take $x = \textcolor{colorspan}{A^\dagger b}$ – where $^\dagger$ denotes the matrix pseudoinverse – so that we have
\begin{equation}
x_{\infty} = \textcolor{colorspan}{A^\dagger b} + {\textcolor{colorcomplement}c} \,,\quad \text{ for some }\textcolor{colorcomplement}{c \in \spanAC}\,.
\end{equation}
Let $P$ denote the orthogonal projection onto $\spanA$. Then we can decompose the initialization as $x_0 = \textcolor{colorspan}{P x_0} + \textcolor{colorcomplement}{(I - P)x_0}$. By the first property all updates are in $\spanA$, so the limit iterate can be written as
\begin{equation}
x_{\infty} = \textcolor{colorcomplement}{(I - P)x_0} + \textcolor{colorspan}{a} \,, \text{ for some }\textcolor{colorspan}{a \in \spanA}\,.
\end{equation}
Combining the previous two equations, we have that ${\textcolor{colorcomplement}c} = \textcolor{colorcomplement}{(I - P)x_0}$ and $\textcolor{colorspan}{a} = \textcolor{colorspan}{A^\dagger b}$. Hence we have arrived at the characterization
\begin{equation}\label{eq:limit_characterization}
x_{\infty} = \textcolor{colorspan}{A^\dagger b} + \textcolor{colorcomplement}{(I - P)x_0}\,.
\end{equation}
</p>
<p>
<span class="tufte-underline">The limit iterate has minimal distance.</span> Let $x_{\star}$ denote the solution to \eqref{eq:minimal_traveling}, which is unique by the strong convexity of this problem. We aim to show that $x_{\star} = x_{\infty}$. Since $x_\star$ is also a solution to the linear system $A x_\star = b$, we must have $x_\star - x_{\infty} \in \spanAC$ and so
\begin{align}
\|x_{\star} - x_0\| &= \|\textcolor{colorcomplement}{x_{\star} - x_{\infty}} + x_{\infty}- x_0\|\\
&= \|\textcolor{colorcomplement}{x_{\star} - x_{\infty}} + \textcolor{colorspan}{A^\dagger b} - \textcolor{colorspan}{P x_0} \|\\
&= \sqrt{ \|\textcolor{colorcomplement}{x_{\star} - x_{\infty}} \|^2 + \| \textcolor{colorspan}{A^\dagger b} - \textcolor{colorspan}{P x_0}\|^2 }\,,
\end{align}
where the last identity follows by orthogonality. Since $x_\star$ minimizes the distance $\|x_{\star} - x_0\|$ , we must have $\textcolor{colorcomplement}{x_{\star} - x_{\infty}} = 0$, and so $x_\infty = x_\star$.
</p>
</div>
<p>
An immediate consequence of this Theorem is that when the initialization $x_0$ is in $\spanA$, then its projection onto $\spanAC$ is zero, and so from Eq. \eqref{eq:limit_characterization} we have $\textcolor{colorspan}{x_{\infty}} = \textcolor{colorspan}{A^\dagger b}$. This corresponds to the minimal norm solution, so we have
</p>
<p class="corollary framed">
Under the same conditions as the theorem, if $x_0 \in \spanA$, then the limit iterate $x_{\infty}$ solves the minimal norm problem
\begin{equation}\label{eq:minnorm}
x_{\infty} = \argmin_{x \in \RR^p} ||x||_2 ~\text{ subject to } A x = b \,.
\end{equation}
</p>
<h2>About the proof</h2>
<p>
I haven't been able to track down the origins of this proof. The first time I saw this argument was in Wilson's 2017 paper,<dt-cite key="wilson2017marginal"></dt-cite> although they don't claim of novelty of this result. A similar argument is also sketched in section 2.1 of Gunasekar et al. 2018.<dt-cite key="gunasekar2018characterizing"></dt-cite>
</p>
<p>
<b>Update 2022/02/17.</b> Thanks to everyone that commented on <a href="https://twitter.com/fpedregosa/status/1480891377835823106">twitter</a> we could track down the origins of this result!. It seems one of the earliest references is this 1978 paper by Herman et al.<dt-cite key="herman1978relaxation"></dt-cite> who derives a similar result for the Kaczmarz method.
</p>
<p>
<b>Acknowledgements.</b> Thanks to <a href="https://vene.ro/">Vlad Niculae</a>, <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a>, <a href="https://www.paulvicol.com/">Paul Vicol</a> and <a href="https://sites.google.com/site/lechaoxiao/">Lechao Xiao</a> for feedback on this post and reporting typos, and everyone that commented on the <a href="https://twitter.com/fpedregosa/status/1480891377835823106">twitter thread</a> with insights and related references. Check out also <a href="https://twitter.com/mathusmassias/status/1480991740727341061">Mathurin Massias' beautiful take on this</a> using duality.
</p>
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
Optimization Nuggets: Exponential Convergence of SGD2021-12-15T00:00:00+01:002021-12-15T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2021-12-15:/blog/2021/exponential-sgd/
<p>
This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax …</script>
<p>
This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{leblond2018improved,
title={Improved asynchronous parallel optimization analysis for stochastic incremental methods},
author={Leblond, Rémi and Pedregosa, Fabian and Lacoste-Julien, Simon},
journal={Journal of Machine Learning Research},
url={https://www.jmlr.org/papers/volume19/17-650/17-650.pdf},
year={2018}
}
@article{gower2019sgd,
title={SGD: General analysis and improved rates},
author={Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter},
journal={International Conference on Machine Learning},
year={2019},
url={https://arxiv.org/pdf/1901.09401.pdf},
organization={PMLR}
}
@article{mokhtari2020stochastic,
title={Stochastic conditional gradient methods: From convex minimization to submodular maximization},
author={Mokhtari, Aryan and Hassani, Hamed and Karbasi, Amin},
journal={Journal of machine learning research},
year={2020},
url={https://arxiv.org/pdf/1804.09554.pdf},
}
@article{schmidt2013fast,
title={Fast convergence of stochastic gradient descent under a strong growth condition},
author={Schmidt, Mark and Le Roux, Nicolas},
journal={arXiv preprint arXiv:1308.6370},
url={https://arxiv.org/pdf/1308.6370.pdf},
year={2013}
}
@article{nesterov1998introductory,
title={Introductory lectures on convex programming volume I: Basic course},
author={Nesterov, Yurii},
journal={Lecture notes},
url={https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.855&rep=rep1&type=pdf},
year={1998}
}
@article{khaled2020better,
title={Better theory for SGD in the nonconvex world},
author={Khaled, Ahmed and Richtárik, Peter},
journal={arXiv preprint arXiv:2002.03329},
url={https://arxiv.org/pdf/2002.03329.pdf},
year={2020}
}
@book{bauschke2011convex,
title={Convex analysis and monotone operator theory in Hilbert spaces},
author={Bauschke, Heinz H and Combettes, Patrick L and others},
volume={408},
year={2011},
journal={Springer},
url={https://doi.org/10.1007/978-3-319-48311-5},
}
@inproceedings{loizou2021stochastic,
title={Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence},
author={Loizou, Nicolas and Vaswani, Sharan and Laradji, Issam Hadj and Lacoste-Julien, Simon},
journal={International Conference on Artificial Intelligence and Statistics},
year={2021},
organization={PMLR},
url={https://arxiv.org/pdf/2002.10542.pdf}
}
@article{vaswani2019painless,
title={Painless stochastic gradient: Interpolation, line-search, and convergence rates},
author={Vaswani, Sharan and Mishkin, Aaron and Laradji, Issam and Schmidt, Mark and Gidel, Gauthier and Lacoste-Julien, Simon},
journal={Advances in neural information processing systems},
volume={32},
pages={3732--3745},
url={https://arxiv.org/pdf/1905.09997.pdf},
year={2019}
}
@article{cevher2019linear,
title={On the linear convergence of the stochastic gradient method with constant step-size},
author={Cevher, Volkan and Vũ, Bằng Công},
journal={Optimization Letters},
volume={13},
number={5},
pages={1177--1187},
year={2019},
url={https://arxiv.org/pdf/1712.01906.pdf},
publisher={Springer}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
// document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\ell}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colorcvx}{RGB}{27, 158, 119}
\definecolor{colorcondition}{RGB}{31,120,180}
\definecolor{colorsmooth}{RGB}{217,95,2}
\definecolor{colornoise}{RGB}{117,112,179}
\definecolor{colorstep}{RGB}{231,41,138}
\definecolor{colorstep2}{RGB}{227,26,28}
\def\cvx{{\color{colorcvx}\boldsymbol{\mu}}}
\def\smooth{{\color{colorsmooth}\boldsymbol{L}}}
\def\noise{{\color{colornoise}\boldsymbol{\sigma}}}
\def\stepsize{{\color{colorstep}\boldsymbol{\gamma}}}
\def\harmonic{{\color{colorstep2}\boldsymbol{h}}}
\def\condition{{\color{colorcondition}\boldsymbol{\kappa}}}
\def\Econd{\mathrm{E}}
$$
</div>
<p>
<b>The problem</b> is to find a minimizer $x_{\star}$ of a function $f$, which can be expressed as the expectation of another function $F$:
\begin{equation}
x_{\star} \in \argmin_{x \in \RR^p} \left\{ f(x) \defas \EE_{\xi}F(x; \xi) = \int_{\mathcal{S}} F(x; \xi)\dif P(\xi)\right\}\,,
\end{equation}
with access to $\nabla F(x; \xi)$, which is the gradient of $F$ with respect to the first argument. Here, $\mathcal{S}$ denotes the sample space, and $\xi \sim P$ is an $\mathcal{S}$-valued random variable.
</p>
<p>
SGD is one of the most popular methods to solve this problem. At each iteration, it samples $\xi_t \sim P$ and generates the next iterate $x_{t+1}$ according to the recursion
\begin{equation}
x_{t+1} = x_t - \gamma_t \nabla F(x_t; \xi_t) \,.
\end{equation}
where $\gamma_t > 0$ is a step-size parameter.
While the most classical setup for SGD is to consider the step-size parameter $\gamma_t$ to be decreasing as a function of the iteration counter $t$, here we'll consider instead the <i>constant</i> step-size $\gamma_t = \gamma$.
</p>
<p>
My goal in this post is to prove that SGD converges exponentially fast using standard assumptions.<dt-note>Exponential convergence is often referred to as linear convergence in the optimization literature. </dt-note> to a neighborhood of the solution. I won't try to be as generic or tight as possible. There has been in fact a lot of work in the last years in devising weaker assumptions that I won't cover. Those interested should see the following papers and references therein.<dt-cite key="cevher2019linear"></dt-cite> <dt-cite key="gower2019sgd"></dt-cite> <dt-cite key="khaled2020better"></dt-cite>
</p>
<p>
<b>Assumptions</b>.
We make 3 assumptions that quantify the 3 constants that appear in the rate: <span style="color: rgb(217,95,2)"><b>smoothness</b></span>, <span style="color: rgb(27, 158, 119)"><b>curvature</b></span> and <span style="color: rgb(117,112,179)"><b>noise</b></span>.
</p>
<p>
The first two assumptions are the classical $\smooth$-smoothness and $\cvx$-strong convexity properties, pervasive in optimization. This is the class of functions where the scalar product $\langle\nabla F(x; \xi) - \nabla F(y; \xi), x - y\rangle$ is lower and upper bounded by a quadratic:
\begin{equation}
\cvx \|x - y\|^2 \leq \langle \nabla F(x; \xi) - \nabla F(y; \xi), x - y\rangle \leq \smooth \|x - y\|^2\,.
\end{equation}
</p>
<p>
These two inequalities can be combined into the following one<dt-note>A proof of this inequality can be found for instance in Theorem 2.1.11 of Nesterov's 1998 book.</dt-note> <dt-cite key="nesterov1998introductory"></dt-cite> which is the one we'll actually use in the proof
\begin{equation}
\begin{aligned}
&\langle \nabla F(x; \xi) - \nabla F(y; \xi), x - y\rangle \geq \\
&\qquad\tfrac{\cvx\smooth}{\cvx + \smooth}\|x - y\|^2 + \tfrac{1}{\cvx + \smooth}\|\nabla F(x; \xi) - \nabla F(y; \xi)\|^2\,.
\end{aligned}\label{eq:smooth_cvx}
\end{equation}
</p>
<p>
<b style="color: rgb(117,112,179)">The third assumption</b> is that the variance of the gradient at optimum is bounded. That is, there exists a finite scalar $\noise$ such that
\begin{equation}
\EE_{\xi} \|\nabla F(x_{\star}; \xi)\|^2 \leq \noise^2\,.
\end{equation}
</p>
<p>
With these ingredients, we can show that the expected error of SGD can be bounded by a sum of two terms, one of which converges exponentially fast in the iteration number, while the second term is independent of the iteration counter.
</p>
<p class="framed" id="main-theorem"><b>Theorem</b>. Let $\stepsize \leq \frac{1}{\cvx + \smooth}$ and let $\harmonic \defas \frac{2 \cvx \smooth}{\cvx + \smooth}$ denote the harmonic mean of $\cvx$ and $\smooth$. Then the expected iterate suboptimality converges exponentially fast to a $2\frac{\stepsize}{\harmonic}\noise^2$-radius ball centered at the solution. More precisely, at iteration $t$ we have the following bound:
\begin{align}
\EE\, \|x_{t} - x_\star\|^2 \leq \underbrace{\vphantom{\sum_i}\left(1 - \stepsize \harmonic \right)^{t}\,\|x_0 - x_\star\|^2}_{\text{exponential decrease}} + \underbrace{\vphantom{\sum_i}2\frac{\stepsize}{\harmonic}\noise^2}_{\text{stationary}}\,.
\end{align}
</p>
<div class="proof">
<p>
Using the definition of $x_{t+1}$ and expanding the square we have
\begin{align}
\|x_{t+1} - x_\star\|^2 &= \|x_t - x_\star\|^2 - 2 \stepsize \langle \nabla F(x_t; \xi_t) - \nabla F(x_\star; \xi_t) + \nabla F(x_\star; \xi_t), x_t - x_\star\rangle \nonumber\\
&\qquad + \stepsize^2 \|\nabla F(x_t; \xi_t) -\nabla F(x_\star; \xi_t) + \nabla F(x_\star; \xi_t)\|^2 \\
&\leq \left( 1 - \stepsize \harmonic \right)\|x_t - x_\star\|^2 + 2\stepsize\left[\stepsize - \tfrac{1}{\cvx + \smooth}\right]\|\nabla F(x_t; \xi_t) - \nabla F(x_\star; \xi_t)\|^2 \nonumber\\
& \qquad - 2 \stepsize \langle \nabla F(x_\star; \xi_t), x_t - x_\star\rangle + 2 \stepsize^2 \| \nabla F(x_\star; \xi_t)\|^2\,,
\end{align}
where in the first inequality we have added and subtracted $\nabla F(x_\star; \xi_t)$, and in the second one we have used Eq. \eqref{eq:smooth_cvx} and the inequality $\|a + b\|^2 \leq 2 \|a\|^2 + 2 \|b\|^2$ on the last term.<dt-note>Alternatively, one could use the tighter bound $\|a + b\|^2 \leq (1+ \lambda) \|a\|^2 + (1 + \lambda^{-1}) \|b\|^2$. This allows to take a step-size of up to $\frac{2}{\cvx + \smooth}$, a factor 2 improvement over the current step-size. The price to pay is a larger stationary term. However, this tighter proof technique seems necessary to recover the optimal step-size $\frac{2}{\cvx + \smooth}$ in the interpolation regime.</dt-note>
</p>
<p>
Taking conditional expectations of the previous expression, the last term can be bounded by $2 \stepsize^2 \noise^2$, while the term before vanishes by optimality of $x_\star$. Furthermore, the condition $\stepsize \leq \frac{1}{\cvx + \smooth}$ ensures the quantity inside the square brackets is negative so we can drop that term. All in all, we have
\begin{equation}
\EE_{\xi_t | x_t} \|x_{t+1} - x_\star\|^2 \leq \left( 1 - \stepsize \harmonic \right)\|x_t - x_\star\|^2 + 2 \stepsize^2 \noise^2\,.
\end{equation}
</p>
<p>
We now take full expectations and unroll the previous inequality until $t=0$, which allows to bound the expected suboptimality by the sum of an exponential decreasing term and a geometric series
\begin{align}
\EE \|x_{t+1} - x_\star\|^2 \leq \left( 1 - \stepsize \harmonic \right)^{t+1}\|x_0 - x_\star\|^2 + 2 \stepsize^2 \noise^2 \sum_{i=0}^t \left( 1 - \stepsize \harmonic \right)^i\,.
\end{align}
The condition $\stepsize \leq \frac{1}{\cvx + \smooth} \leq \frac{1}{\harmonic}$ also ensures that $0 \leq 1 - \stepsize \harmonic \leq 1$, so geometric series in the last term is monotonically increasing in $t$ and with a finite limit. The last step is to bound this geometric series by its limit:
\begin{equation}
\sum_{i=0}^t \left( 1 - \stepsize \harmonic \right)^i \leq \sum_{i=0}^\infty \left( 1 - \stepsize \harmonic \right)^i = \frac{1}{\stepsize \harmonic}\,.
\end{equation}
Plugging this last inequality into the previous one gives the desired bound.
</p>
</div>
<p>
<b>Implications for interpolating models.</b>
For models in which $\nabla F(x_{\star}; \xi) = 0$, we have that the noise parameter $\noise$ is zero and so SGD converges exponentially fast to the global optimum. The condition $\nabla F(x_{\star}; \xi) = 0$ is often referred to as the interpolation condition (intuitively, this condition states that all partial losses $F(\cdot, \xi)$ admit the same minimizer) and has been recently leveraged to design backtracking and adaptive step-size schedulers.<dt-cite key="vaswani2019painless"></dt-cite> <dt-cite key="loizou2021stochastic"></dt-cite>
</p>
<p>
In this case, the above theorem does not have the stationary term, and implies a global exponential convergence. As far as I know, this was first noted by Schmidt and Le Roux.<dt-cite key="schmidt2013fast"></dt-cite>
</p>
<p>
<b>Acknowledgements.</b> Thanks to <a href="https://konstmish.github.io/">Konstantin Mishchenko</a> for spotting numerous typos and pointing out I was wrongly using the term "over-parametrized models" (now corrected), <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a> for reporting typos and making very useful suggestions,
<a href="https://gowerrobert.github.io/">Robert Gower</a> and <a href="https://scholar.google.ca/citations?user=XE9SDzgAAAAJ&hl=en">Bart van Merriënboer</a> for feedback on the post.
</p>
<br><br><br>
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
On the Link Between Optimization and Polynomials, Part 42021-04-13T00:00:00+02:002021-04-13T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2021-04-13:/blog/2021/no-momentum/
<p>
While the most common accelerated methods like Polyak and Nesterov incorporate a momentum term, a little known fact is that simple gradient descent –no momentum– can achieve the same rate
through only a well-chosen sequence of step-sizes. In this post we'll derive this method and through simulations discuss its practical …</p>
<p>
While the most common accelerated methods like Polyak and Nesterov incorporate a momentum term, a little known fact is that simple gradient descent –no momentum– can achieve the same rate
through only a well-chosen sequence of step-sizes. In this post we'll derive this method and through simulations discuss its practical limitations.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{fletcher2005barzilai,
title={On the Barzilai-Borwein method},
author={Fletcher, Roger},
journal={Optimization and control with applications},
year={2005},
publisher={Springer}
}
@article{dai2005asymptotic,
title={On the asymptotic behaviour of some new gradient methods},
author={Dai, Yu-Hong and Fletcher, Roger},
journal={Mathematical Programming},
year={2005},
publisher={Springer},
url={https://link.springer.com/content/pdf/10.1007%2Fs10107-004-0516-9.pdf},
}
@article{Rutishauser1959,
author="Rutishauser, H.",
title="Theory of Gradient Methods",
journal="Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-Adjoint Boundary Value Problems",
year="1959",
url="https://doi.org/10.1007/978-3-0348-7224-9_2"
}
@article{hardt2013zen,
author="Moritz Hardt",
title="The zen of gradient descent",
journal="Moody Rd (blog)",
year="2013",
url="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html"
}
@book{golub2012matrix,
title={Matrix computations},
author={Golub, Gene H and Van Loan, Charles F},
year={2013},
journal={Hohns Hopkins University Press}
}
@article{yuan2008step,
title={Step-sizes for the gradient method},
author={Yuan, Ya-xiang},
journal={AMS IP Studies in Advanced Mathematics},
volume={42},
number={2},
pages={785},
year={2008},
publisher={Providence, RI; American Mathematical Society; 1999},
url={ftp://lsec.cc.ac.cn/pub/yyx/papers/p0504.pdf}
}
@article{flanders1950numerical,
title={Numerical determination of fundamental modes},
author={Flanders, Donald A and Shortley, George},
journal={Journal of Applied Physics},
year={1950},
url={https://doi.org/10.1063/1.1699598},
openaccess={https://sci-hub.tw/10.1063/1.1699598}
}
@article{nemirovski1995information,
title={Information-based complexity of convex programming},
author={Nemirovski, Arkadi},
journal={Lecture Notes (Lecture 12)},
year={1995},
url={https://www2.isye.gatech.edu/~nemirovs/Lec_EMCO.pdf}
}
@article{bauer2011my,
title={My years with Rutishauser},
author={Bauer, Friedrich L},
journal={Informatik-Spektrum},
year={2011},
publisher={Springer},
url={https://sci-hub.tw/10.1007/s00287-011-0554-7}
}
@article{gutknecht1910numerical,
title={Numerical analysis in Zurich--50 years ago},
author={Gutknecht, Martin H},
journal={Schweizerische Mathematische Gesellschaft},
year={2010},
url={http://www.sam.math.ethz.ch/~mhg/pub/mhg-published/notes8-Gut10-NAZ-SMG100.pdf}
}
@article{cauchy1847methode,
title={Méthode générale pour la résolution des systèmes d'équations simultanées},
author={Cauchy, Augustin},
journal={Comp. Rend. Sci. Paris},
volume={25},
number={1847},
pages={536--538},
year={1847},
url={https://gallica.bnf.fr/ark:/12148/bpt6k90190w/f406}
}
@article{lemarechal2012cauchy,
title={Cauchy and the gradient method},
author={Lemaréchal, Claude},
journal={Doc Math Extra},
year={2012},
url={http://emis.ams.org/journals/DMJDMV/vol-ismp/40_lemarechal-claude.pdf}
}
@incollection{scieur2016regularized,
title = {Regularized Nonlinear Acceleration},
author = {Scieur, Damien and d'Aspremont, Alexandre and Bach, Francis},
journal = {Advances in Neural Information Processing Systems 29},
year = {2016},
url = {https://arxiv.org/pdf/1606.04133.pdf}
}
@book{fischer1996polynomial,
title={Polynomial based iteration methods for symmetric linear systems},
author={Fischer, Bernd},
year={1996},
url={https://doi.org/10.1007/978-3-663-11108-5},
journal={Springer}
}
@book{saad2003iterative,
title={Iterative methods for sparse linear systems},
author={Saad, Yousef},
volume={82},
year={2003},
journal={SIAM},
url={https://www-users.cs.umn.edu/~saad/IterMethBook_2ndEd.pdf}
}
@article{lanczos1950iteration,
title={An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators1},
author={Lanczos, Cornelius},
journal={Journal of Research of the National Bureau of Standards},
year={1950},
url={http://www.cs.umd.edu/~oleary/lanczos1950.pdf}
}
@article{hestenes1952methods,
title={Methods of conjugate gradients for solving linear systems},
author={Hestenes, Magnus R and Stiefel, Eduard},
journal={Journal of research of the National Bureau of Standards},
year={1952},
url={https://pdfs.semanticscholar.org/466d/addfb6340c28cb8da548007028c8cc5df687.pdf}
}
@article{pedregosa2020average,
title={Average-case Acceleration Through Spectral Density Estimation},
author={Pedregosa, Fabian and Scieur, Damien},
journal={arXiv preprint arXiv:2002.04756},
year={2020},
url={https://arxiv.org/pdf/2002.04756.pdf}
}
@article{scieur2020universal,
title={Universal Average-Case Optimality of Polyak Momentum},
author={Scieur, Damien and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2002.04664},
year={2020},
url={https://arxiv.org/pdf/2002.04664.pdf}
}
@article{bach2019polynomial,
title={Polynomial magic I: Chebyshev polynomials},
author={Bach, Francis},
url={https://francisbach.com/chebyshev-polynomials/},
journal={Blog post},
year={2019}
}
@article{young1953richardson,
title={On Richardson's method for solving linear systems with positive definite matrices},
author={Young, David},
journal={Journal of Mathematics and Physics},
year={1953},
url={https://doi.org/10.1002/sapm1953321243},
publisher={Wiley Online Library},
openaccess={https://sci-hub.se/10.1002/sapm1953321243}
}
@article{loshchilov2016sgdr,
title={SGDR: Stochastic gradient descent with warm restarts},
author={Loshchilov, Ilya and Hutter, Frank},
journal={International Conference on Learning Representations},
year={2017},
url={https://arxiv.org/pdf/1608.03983.pdf}
}
@article{wadayama2020chebyshev,
title={Chebyshev inertial iteration for accelerating fixed-point iterations},
author={Wadayama, Tadashi and Takabe, Satoshi},
journal={arXiv preprint arXiv:2001.03280},
year={2020},
url={https://arxiv.org/pdf/2001.03280.pdf}
}
@inproceedings{roulet2017sharpness,
title={Sharpness, restart and acceleration},
author={Roulet, Vincent and d'Aspremont, Alexandre},
booktitle={Advances in Neural Information Processing Systems},
pages={1119--1129},
year={2017},
url={http://papers.nips.cc/paper/6712-sharpness-restart-and-acceleration.pdf}
}
@article{agarwal2021acceleration,
title={Acceleration via Fractal Learning Rate Schedules},
author={Agarwal, Naman and Goel, Surbhi and Zhang, Cyril},
journal={Proceedings of the 38 th International Conference on Machine
Learning},
url={https://arxiv.org/pdf/2103.01338.pdf},
year={2021}
}
@article{lebedev1971order,
title={The order of choice of the iteration parameters in the cyclic Chebyshev iteration method},
author={Lebedev, Vyacheslav Ivanovich and Finogenov, SA},
journal={Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki},
year={1971},
publisher={Russian Academy of Sciences, Branch of Mathematical Sciences},
url={https://doi.org/10.1016/0041-5553(71)90169-8},
openaccess={https://sci-hub.se/10.1016/0041-5553(71)90169-8}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\mu}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colormomentum}{RGB}{27, 158, 119}
\definecolor{colorstepsize}{RGB}{217, 95, 2}
\def\mom{{\color{colormomentum}m}}
\def\step{{\color{colorstepsize}h}}
$$
</div>
<h2>Residual Polynomials and the Chebyshev iterative method</h2>
<p>
As in previous installments if this series of blog posts, our motivating problem is that of
finding a real-valued vector $\xx^\star$ that minimizes the
quadratic function
\begin{equation}\label{eq:opt}
f(\xx) \defas \frac{1}{2}\xx^\top \HH \xx + \bb^\top \xx~,
\end{equation}
where $\HH$ is a $d\times d$ positive definite matrix with bounded eigenvalues in the interval $[\lmin, L]$.
</p>
<p>
I described in the <a href="http://fa.bianp.net/blog/2020/polyopt/">first part</a> of this series a connection between optimization methods and polynomials that allows to cast complexity analysis of optimization methods as a problem of bounding polynomials and which I'll be using extensively here. Through this connection, each optimization method determines a sequence of polynomials that determine the convergence of the method. More precisely, after $t$ iterations, we can associate to every optimization method a polynomial $P_t(\lambda) = a_t \lambda^t + \cdots + 1$ of degree $t$ and $P_t(0) = 1$ such that the error at iteration $t$ is given in terms of this polynomial as<dt-note>$P_t(\HH)$ represents the output of evaluating the originally real-valued polynomial $P_t(\cdot)$ at the matrix $\HH$. I'm being a bit liberal with notation here, as $P_t$ represents both the real-valued and matrix-valued polynomial depending on the input.</dt-note>
\begin{equation}
\xx_t - \xx^\star = P_t(\HH)(\xx_0 - \xx^\star)\,.
\end{equation}
</p>
<p>
As we saw <a href="http://fa.bianp.net/blog/2020/polyopt/">in the first part</a> of this series, there's a very illustrious polynomial that minimizes the <i>worst-case convergence rate</i> $\max_{\lambda \in [\lmin, L]}|P_t(\lambda)|$: the scaled and shifted Chebyshev polynomial
\begin{equation}
P^\text{Cheb}_t(\lambda) = \frac{T_t(\sigma(\lambda))}{T_t(\sigma(0))}\,,
\end{equation}
where $T_t$ is the $t$-th degree <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials">Chebyshev polynomial</a> and $\sigma$ is the link function
$\sigma(\lambda) = \frac{L+\lmin}{L-\lmin} - \frac{2}{L - \lmin}\lambda$.
</p>
<p>
Back then we reverse-engineered an optimization method that has as associated residual polynomial the above shifted Chebyshev polynomial.
It turns out that this is not <i>the only way</i> to construct a method that has $P^\text{Cheb}_t$ as its residual polynomial. There are many other ways to do this, and at least one of them doesn't require a momentum term.
</p>
<h2>Young's method</h2>
<p>
Consider the gradient descent method with variable step-size. This method has a recurrence of the form
\begin{equation}
\xx_{t+1} = \xx_t - {\color{colorstepsize}
h_t} \nabla f(\xx_t)\,.
\end{equation}
with one scalar parameter per iteration: the step-size ${\color{colorstepsize}
h_t}$.
</p>
<p>
Subtracting $\xx^\star$ from both sides of the above equation and using the fact that for quadratic objectives $\nabla f(\xx) = \HH \xx$ we have
\begin{align}
\xx_{t+1} - \xx^\star &= (\II - {\color{colorstepsize}h_t} \HH) (\xx_t - \xx^\star)\\
&= (\II - {\color{colorstepsize}h_t}\HH)(\II - {\color{colorstepsize}h_{t-1}}\HH) (\xx_{t-1} - \xx^\star)\\
&= \cdots \\
&= (\II - {\color{colorstepsize}h_t}\HH)(\II - {\color{colorstepsize}h_{t-1}}\HH)\cdots (\II - {\color{colorstepsize}h_0}) (\xx_0 - \xx^\star)\,,
\end{align}
and so the residual polynomial associated with this method is
\begin{align}
P^{\text{GD}}_t(\lambda) = (1 - {\color{colorstepsize} h_t} \lambda)(1 - {\color{colorstepsize} h_{t-1}} \lambda) \cdots (1 - {\color{colorstepsize} h_0}
\lambda)\,.
\end{align}
</p>
<p>
Since the residual polynomial determines the convergence of a method, a simple way to ensure that the variable step-size gradient descent method has the same convergence as the Chebyshev iterative method is to find the parameters –in this case the step-size– that make $P^{\text{GD}}_t$ equal to the Chebyshev residual polynomial.
</p>
<p>
David Young<dt-note>
<img style="display: block; margin: 0 auto; max-width: 150px; box-shadow: 6px 6px 3px grey;" src="/images/2020/david_young.jpg" alt=""> <br><a href="https://en.wikipedia.org/wiki/David_M._Young_Jr.">David Young</a> (1923-2008) was an American mathematician and computer scientist. He is known for some of the most successful linear algebra methods such as <a href="https://en.wikipedia.org/wiki/Successive_over-relaxation">successive over-relaxation</a> and <a href="https://en.wikipedia.org/wiki/Symmetric_successive_over-relaxation">symmetric successive over-relaxation</a>.</dt-note> devised in 1953 an elegant approach to set these parameters.<dt-cite key="young1953richardson"></dt-cite> <dt-note>The main motivation of this algorithm at the time was to save memory by avoiding the momentum term. Although computer capabilities have expanded greatly since then, memory is often still the bottleneck <a href="https://papers.nips.cc/paper/9168-memory-efficient-adaptive-optimization.pdf">(Anil et al. 2019)</a>. </dt-note> His method uses the fact that the $P^{\text{GD}}_t$ polynomial above is already in factored form and its roots are
\begin{equation}\label{eq:roots_gd}
\frac{1}{{\color{colorstepsize} h_0}}, \frac{1}{{\color{colorstepsize} h_1}}, \ldots, \frac{1}{{\color{colorstepsize}
h_t}}\,.
\end{equation}
Young's key insight is that to construct the same residual polynomial it's sufficient that the <i>roots</i> match, as the scale is anyway fixed by the residual polynomial constraint $P_t(0) = 1$. This approach is practical because the roots of Chebyshev polynomials are easy to derive using known identities between these polynomials and trigonometric functions.
</p>
<p class="theorem" text="Roots of Chebyshev residual polynomial">
The roots of the $t$-th degree Chebyshev residual polynomial $P^\text{Cheb}_t$ are given by
\begin{equation}\label{eq:chebyshev_roots}
\lambda_i = {\textstyle \frac{1}{2}(L + \lmin) + \frac{1}{2}(L -
\lmin)\cos\left(\frac{\pi(i+1/2)}{t}\right)}\,,
\end{equation}
for $i=0, \ldots, t-1$.
</p>
<div class="proof">
<p>
The Chebyshev polynomial $T_t$ can be characterized as the polynomial of degree $t$ that
verifies
\begin{equation}
T_t(\cos(\theta)) = \cos(t \theta)\,.
\end{equation}
Since $\cos((i + 1/2) \pi) = 0$ for any integer $i$, the right-hand size is zero for
$\theta = (i+1/2)\pi/t$, and so the
roots of the Chebyshev polynomial<dt-note>These roots are sometimes called <a href="https://en.wikipedia.org/wiki/Chebyshev_nodes">Chebyshev nodes</a> because they are
used as nodes in polynomial interpolation. </dt-note> $T_t$ are
given by
\begin{equation}
\xi_i = {\textstyle \cos\left(\frac{\pi(i+1/2)}{t}\right),}~ i=0,\ldots,t-1\,.
\end{equation}
</p>
<p>
From the roots of the Chebyshev polynomials, we can find the roots of the
<i>shifted</i> Chebyshev polynomial $P^\text{Cheb}_t$ by solving $\sigma(\lambda_i) = \xi_i$ for
$\lambda_i$, which gives the desired value $\lambda_i$ above.
</p>
</div>
<p>
These roots also have a nice trigonometric interpretation, as they correspond to the projection onto the $x$-axis of $t$ equally spaced points on the semicircle between $\lmin$ and $L$.
</p>
<figure>
<span class="marginnote"> The roots of the Chebyshev residual polynomial \eqref{eq:chebyshev_roots} can be seen as the $x$ coordinates of $t$ equally spaced
points on the semicircle between $\lmin$ and $L$ (in this case $t=15$). <br> <a href="https://colab.research.google.com/gist/fabianp/736b1d0a829dd93764d67b412da10f97/no_momentum.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/ChebyshevRoots.svg" alt="">
</figure>
<p>
Coming back to Young's method, if we set the step-size parameters as
\begin{equation}
{\color{colorstepsize} h_i} = \frac{1}{\lambda_i}\,, i = 0, \ldots, t-1\,,
\end{equation}
where $\lambda_i$ are Chebyshev roots in the previous Theorem, the residual polynomial for this method and the Chebyshev residual polynomial at iteration $t$ are identical. Furthermore, this is true <i>irrespective</i> of the order in which the step-sizes are taken. For now let's take the steps in increasing order and see what happens. We'll reconsider this reckless decision later on.
</p>
<p>
Putting it all together, we can finally write Young's method:
</p>
<p class="framed">
<b class="tufte-underline">Young's Method</b><br>
<b>Input</b>: starting guess $\xx_0$ and $t_{\max}$.<br>
<!-- Let $\gamma(i) \defas \Big({\textstyle\frac{1}{2}(L + \lmin) + \frac{1}{2}(L -
\lmin)\cos(\frac{\pi(i-1/2)}{k})}\Big)^{-1}$ <br>
and $\sigma(i) \defas \lfloor{T/2}\rfloor + (-1)^i\lfloor (i-1)/2\rfloor$ <br> -->
<b>For</b> $t=0, \ldots, t_{\max}$ compute
\begin{align*}
&\quad{\color{colorstepsize}
h_t} = 2\big({(L + \lmin) + (L -
\lmin)\cos({(t - \tfrac{1}{2})}\tfrac{\pi}{t_{\max}})}\big)^{-1}\\
&\quad\xx_{t+1} = \xx_t - {\color{colorstepsize}
h_t}\nabla f(\xx_t)
\end{align*}
</p>
<p>
An empirical comparison of this method vs gradient descent and the Chebyshev iterative method on synthetic data confirms that this method achieves the same last-iterate convergence rate.
</p>
<figure>
<span class="marginnote">
<br>
Young's method has the same last-iterate convergence rate than the Chebyshev iterative method on quadratic problems. The suboptimality of previous iterates is substantially different and depends on the step-size ordering.
<br> <br> <a href="https://colab.research.google.com/gist/fabianp/736b1d0a829dd93764d67b412da10f97/no_momentum.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
</span>
<img src="/images/2020/convergence_young_0.svg" alt="">
</figure>
<p>
As is clear from the plot above, in Young's method <i>only the last iterate</i> has a rate of convergence that matches that of the Chebyshev method. This is because its residual polynomial coincides with the shifted Chebyshev polynomial $P^\text{Cheb}_t$ <i>only</i> at the last iterate. In contrast, for the Chebyshev method, the residual polynomial at <i>each iterate</i> is the residual Chebyshev polynomial $P^\text{Cheb}_t$, making it optimal at every iteration, and not only at the last iterate.<dt-note>Another way to say this is that Chebyshev is an <i>anytime</i> algorithm, while Young's method isn't.</dt-note>
</p>
<p>
A practical drawback is that one needs to know the number of iterations in advance. However, in practice it's more convenient to set a tolerance and not a maximum number of iterations.
</p>
<h2>Beware of Numerical Errors!</h2>
<p>
As was already noted by Young, this method is unfortunately very numerically unstable. As the number of iterations $t_{\max} $ grows, the maximum step-size converges to $1/\lmin$, which can become problematic when $\lmin$ is small as then small errors in the computation of $\lmin$ get amplified.
</p>
<p>
For example, in the same setting as the previous figure, I'm increasing the number of iterations from 30 to 50, which is sufficient to see a divergence of the error in the last few iterates.
</p>
<figure>
<span class="marginnote">Young's method suffers from numerical issues as the number of iterations $t_{\max} $ grows to infinity and $\lmin$ is close to zero.
<br><br> <a href="https://colab.research.google.com/gist/fabianp/736b1d0a829dd93764d67b412da10f97/no_momentum.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/convergence_young_1.svg" alt="">
</figure>
<p>
A simple workaround is to avoid the step-size getting too close to $1/\lmin$ is to limit the number of maximum iterations $t_{\max}$ to something small, and then start over once we visited the $t_{\max}$ step-sizes. I call this method the "Young's method with restarts".
</p>
<p class="framed">
<b class="tufte-underline">Young's Method with Restarts</b><br>
<b>Input</b>: starting guess $\xx_0$ and restarting length $t_{\text{cycle}}$.<br>
<!-- Let $\gamma(i) \defas \Big({\textstyle\frac{1}{2}(L + \lmin) + \frac{1}{2}(L -
\lmin)\cos(\frac{\pi(i-1/2)}{k})}\Big)^{-1}$ <br>
and $\sigma(i) \defas \lfloor{T/2}\rfloor + (-1)^i\lfloor (i-1)/2\rfloor$ <br> -->
<b>For</b> $t=0, 1, \ldots$ compute
\begin{align*}
&\quad k = t\bmod t_{\text{cycle}} \\
&\quad{\color{colorstepsize}
h_t} = 2\big({(L + \lmin) + (L -
\lmin)\cos({( k- \tfrac{1}{2})}\tfrac{\pi}{t_{\text{cycle}}})}\big)^{-1}\\
&\quad\xx_{t+1} = \xx_t - {\color{colorstepsize}
h_t}\nabla f(\xx_t)
\end{align*}
</p>
<p>
Although this method is no longer optimal,<dt-note>Or, to be more precise, it is only optimal when stopped after $t_{\text{cycle}}$ iterations, in which case it coincides with the standard Young method.</dt-note> it avoids some of the numerical issues of the standard method. See below a comparison on the same problem as above when I take $t_{\text{cycle}}$ to be 15:
</p>
<figure>
<span class="marginnote">The convergence speed of the Cyclical Young method is similar to the Chebyshev method, and it avoids some numerical issues associated
<br> <a href="https://colab.research.google.com/gist/fabianp/736b1d0a829dd93764d67b412da10f97/no_momentum.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/convergence_young_1_bis.svg" alt="">
</figure>
<h2>A connection with cosine annealing and other recent work</h2>
<p>
Theres all sorts of connections still to be made between variable step-size methods and acceleration.
</p>
<p>
An obvious question is whether any of this extends to the stochastic setting. Although I don't have an answer for this, there are some similarities between this method and the recently proposed<dt-cite key="loshchilov2016sgdr"></dt-cite> and wildly successful<dt-note>As of August 2020, the paper counts more than 1000 citations.</dt-note> <dt-note>See also my recent blog post on <a href="/blog/2022/cyclical/">cyclical step-sizes</a>.</dt-note> cosine annealing step-size
\begin{align*}
h_i &= h_{\min} + \tfrac{1}{2}(h_{\max} - h_{\min})(1 +
\cos(\frac{i}{t_{\max}}\pi))\\
&= \tfrac{1}{2}(h_{\min} + h_{\max}) + \tfrac{1}{2}(h_{\max} - h_{\min})
\cos(\frac{i}{t_{\max}}\pi)
\end{align*}
which is equal to the <i>inverse</i> of the Chebyshev roots \eqref{eq:chebyshev_roots}, modulo the very minor modification $i+1/2 \to i$.
</p>
<p>
After writing this blog post, I became aware of this recent paper by <a href="https://naman33k.github.io/">Naman Agarwal</a>, <a href="https://www.cs.utexas.edu/~surbhi/">Surbhi Goel</a> and <a href="https://cyrilzhang.com/">Cyril Zhang</a><dt-cite key="agarwal2021acceleration"></dt-cite> that proposes a <i>fractal</i> ordering of the step-sizes. They show that this ordering is more robust, and also provide robustness analysis, experimental validation and links with the Soviet literature that I was unaware of.<dt-cite key="lebedev1971order"></dt-cite> Definitely worth checking out!
</p>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing it as:
</p>
<p style="margin-left: 5%">
<a href="http://fa.bianp.net/blog/2021/no-momentum/">Acceleration without Momentum</a>, Fabian Pedregosa, 2021
<p>
<p>
Bibtex entry:
</p>
<pre>
<code>
@misc{pedregosa2021nomomentum,
title={Acceleration without Momentum},
author={Pedregosa, Fabian},
howpublished = {\url{http://fa.bianp.net/blog/2021/no-momentum/}},
year={2021}
}
</code>
</pre>
<h2>Thanks</h2>
<p>
This blog post grew out of numerous discussions with collaborators <a href="http://nicolas.le-roux.name/">Nicolas Le Roux</a> and <a href="https://damienscieur.com/">Damien Scieur</a>. A note of gratitude is due also to <a href="https://konstmish.github.io/">Konstantin Mischenko</a>, for constructive comments and for suggesting the name "Young method with restarts".
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
On the Link Between Optimization and Polynomials, Part 32021-03-02T00:00:00+01:002021-03-02T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2021-03-02:/blog/2021/hitchhiker/
<blockquote class="pullquote">
<p>
<q><i>I've seen things you people wouldn't believe. <br>
Valleys sculpted by trigonometric functions. <br>
Rates on fire off the shoulder of divergence. <br>
Beams glitter in the dark near the Polyak gate. <br>
All those landscapes will be lost in time, like tears in rain. <br>Time to halt.</i></q> <br>
</p>
<p style="text-align: right;">
A momentum optimizer <a href="https://en.wikipedia.org/wiki/Tears_in_rain_monologue">*</a>
</p>
</blockquote>
<figure class="fullwidth">
<img style="max-width: 1000px;" src="/images/2021/rate_convergence_momentum.png" alt="">
</figure>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config …</script>
<blockquote class="pullquote">
<p>
<q><i>I've seen things you people wouldn't believe. <br>
Valleys sculpted by trigonometric functions. <br>
Rates on fire off the shoulder of divergence. <br>
Beams glitter in the dark near the Polyak gate. <br>
All those landscapes will be lost in time, like tears in rain. <br>Time to halt.</i></q> <br>
</p>
<p style="text-align: right;">
A momentum optimizer <a href="https://en.wikipedia.org/wiki/Tears_in_rain_monologue">*</a>
</p>
</blockquote>
<figure class="fullwidth">
<img style="max-width: 1000px;" src="/images/2021/rate_convergence_momentum.png" alt="">
</figure>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{Rutishauser1959,
author="Rutishauser, H.",
title="Theory of Gradient Methods",
journal="Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-Adjoint Boundary Value Problems",
year="1959",
url="https://doi.org/10.1007/978-3-0348-7224-9_2"
}
@article{flanders1950numerical,
title={Numerical determination of fundamental modes},
author={Flanders, Donald A and Shortley, George},
journal={Journal of Applied Physics},
year={1950},
url={https://sci-hub.tw/10.1063/1.1699598},
}
@book{fischer1996polynomial,
title={Polynomial based iteration methods for symmetric linear systems},
author={Fischer, Bernd},
year={1996},
url={https://doi.org/10.1007/978-3-663-11108-5},
journal={Springer}
}
@article{hestenes1952methods,
title={Methods of conjugate gradients for solving linear systems},
author={Hestenes, Magnus and Stiefel, Eduard},
journal={Journal of research of the National Bureau of Standards},
year={1952},
url={https://pdfs.semanticscholar.org/466d/addfb6340c28cb8da548007028c8cc5df687.pdf}
}
@article{pedregosa2020average,
title={Average-case Acceleration Through Spectral Density Estimation},
author={Pedregosa, Fabian and Scieur, Damien},
journal={arXiv preprint arXiv:2002.04756},
year={2020},
url={https://arxiv.org/pdf/2002.04756.pdf}
}
@article{scieur2020universal,
title={Universal Average-Case Optimality of Polyak Momentum},
author={Scieur, Damien and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2002.04664},
year={2020},
url={https://arxiv.org/pdf/2002.04664.pdf}
}
@article{polyak1964some,
title={Some methods of speeding up the convergence of iteration methods},
author={Polyak, Boris T},
journal={USSR Computational Mathematics and Mathematical Physics},
year={1964},
url={https://doi.org/10.1016/0041-5553(64)90137-5}
}
@article{polyak1987introduction,
title={Introduction to Optimization},
author={Polyak, Boris T},
journal={Optimization Software, Inc. Publications Division, New York},
url={https://b-ok.cc/book/2461679/c8b7e4},
year={1987}
}
@article{frankel1950convergence,
title={Convergence rates of iterative treatments of partial differential equations},
author={Frankel, Stanley},
journal={Mathematical Tables and Other Aids to Computation},
year={1950},
url={https://www.jstor.org/stable/2002770},
publisher={JSTOR}
}
@inproceedings{sutskever2013importance,
title={On the importance of initialization and momentum in deep learning},
author={Sutskever, Ilya and Martens, James and Dahl, George and Hinton, Geoffrey},
journal={International conference on machine learning},
year={2013},
url={http://proceedings.mlr.press/v28/sutskever13.pdf}
}
@article{zhang20202dive,
title={Dive into Deep Learning},
author={Zhang, Aston and Lipton, Zachary C and Li, Mu and Smola, Alexander J},
journal={https://d2l.ai},
year={2020},
url={https://d2l.ai/chapter_optimization/momentum.html}
}
@book{elaydi2005introduction,
title={An introduction to difference equations},
author={Elaydi, Saber},
year={2005},
journal={Springer}
}
@article{perron1921summengleichungen,
title={{\"U}ber summengleichungen und Poincar{\'e}sche differenzengleichungen},
author={Perron, Oskar},
journal={Mathematische Annalen},
year={1921},
publisher={Springer}
}
@article{hochstrasser1954anwendung,
title={Die Anwendung der Methode der konjugierten Gradienten und ihrer Modifikationen auf die Lösung linearer Randwertprobleme},
author={Hochstrasser, Urs},
year={1954},
journal={Doctoral Thesis (in German), ETH Zurich},
url={https://doi.org/10.3929/ethz-a-000091966}
}
@inproceedings{ghadimi2015global,
title={Global convergence of the heavy-ball method for convex optimization},
author={Ghadimi, Euhanna and Feyzmahdavian, Hamid Reza and Johansson, Mikael},
booktitle={2015 European Control Conference (ECC)},
year={2015},
url={https://arxiv.org/abs/1412.7457},
}
@inproceedings{flammarion2015averaging,
title={From averaging to acceleration, there is only a step-size},
author={Flammarion, Nicolas and Bach, Francis},
booktitle={Conference on Learning Theory},
year={2015}
}
@book{khrushchev2008orthogonal,
title={Orthogonal polynomials and continued fractions: from Euler's point of view},
author={Khrushchev, Sergey},
year={2008},
url={https://www.maths.ed.ac.uk/~v1ranick/papers/khrushchev.pdf},
publisher={Cambridge University Press}
}
@article{totik2005orthogonal,
title={Orthogonal polynomials},
author={Totik, Vilmos},
journal={arXiv preprint math/0512424},
year={2005},
url={https://arxiv.org/pdf/math/0512424.pdf}
}
@book{suli2003introduction,
title={An introduction to numerical analysis},
author={Süli, Endre and Mayers, David},
year={2003},
publisher={Cambridge University Press},
url={https://www.cambridge.org/core/books/an-introduction-to-numerical-analysis/FD8BCAD7FE68002E2179DFF68B8B7237}
}
@book{deift1999orthogonal,
title={Orthogonal polynomials and random matrices: a Riemann-Hilbert approach},
author={Deift, Percy},
volume={3},
year={1999},
publisher={American Mathematical Soc.},
url={https://books.google.ca/books?id=SBR8yv0LkFgC&lpg=PP1&pg=PP1#v=onepage&q&f=false}
}
@article{legendre1782,
title={Recherches sur l'attraction des sphéroïdes homogènes},
author={Legendre, Adrien-Marie},
year={1782},
url={https://web.archive.org/web/20090920070434/http://edocs.ub.uni-frankfurt.de/volltexte/2007/3757/pdf/A009566090.pdf},
journal={Mémoires de Mathématiques et de Physique, présentés à l'Académie Royale des Sciences, par divers savans, et lus dans ses Assemblées}
}
@book{szeg1975orthogonal,
title={Orthogonal polynomials},
author={Szegő, Gábor},
year={1975},
journal={American Mathematical Soc.},
url={https://people.math.osu.edu/nevai.1/SZEGO/szego=szego1975=ops=OCR.pdf}
}
@book{gautschi2004orthogonal,
title={Orthogonal polynomials},
author={Gautschi, Walter},
year={2004},
journal={Oxford University Press},
url={https://global.oup.com/academic/product/orthogonal-polynomials-9780198506720?cc=ca&lang=en&}
}
@book{chihara2011introduction,
title={An introduction to orthogonal polynomials},
author={Chihara, Theodore S},
year={2011},
publisher={Courier Corporation},
url={https://books.google.ca/books?id=71CVAwAAQBAJ&lpg=PP1&pg=PP1#v=onepage&q&f=false}
}
@article{marcellan2001favard,
title={On the “Favard theorem” and its extensions},
author={Marcellán, Francisco and Álvarez-Nodarse, Renato},
journal={Journal of computational and applied mathematics},
year={2001},
url={http://merlin.us.es/~renato/papers/fav-jcam.pdf},
publisher={Elsevier}
}
@article{favard1935,
author = {Jean Favard},
title = {Sur les polynomes de Tchebicheff},
journal = {Comptes Rendus Hebdomadaires des Séances de l'Académie des Sciences, Paris},
url={https://gallica.bnf.fr/ark:/12148/bpt6k3152t/f2052.item},
volume = {200},
year = {1935},
publisher = {Gauthier-Villars, Paris},
}
@article{cretney2014origins,
title={The origins of Euler's early work on continued fractions},
author={Cretney, Rosanna},
journal={Historia Mathematica},
url={https://doi.org/10.1016/j.hm.2013.12.004},
pages={139--156},
year={2014},
publisher={Elsevier}
}
@book{nevai1979orthogonal,
title={Orthogonal polynomials},
author={Nevai, Paul G},
volume={213},
year={1979},
journal={American Mathematical Soc.},
url={https://books.google.ca/books?id=hxTUCQAAQBAJ&lpg=PP1&pg=PP1#v=onepage&q&f=false}
}
@book{mhaskar1997introduction,
title={Introduction to the theory of weighted polynomial approximation},
author={Mhaskar, Hrushikesh Narhar},
volume={7},
year={1997},
publisher={World Scientific}
}
@article{aptekarev1994spatial,
title={Spatial entropy of central potentials and strong asymptotics of orthogonal polynomials},
author={Aptekarev, AI and Dehesa, JS and Yáñez, RJ},
journal={Journal of Mathematical Physics},
volume={35},
number={9},
pages={4423--4428},
year={1994},
publisher={American Institute of Physics},
url={https://aip.scitation.org/doi/10.1063/1.530861}
}
@book{grenander1958toeplitz,
title={Toeplitz forms and their applications},
author={Grenander, Ulf and Szegő , Gábor},
year={1958},
publisher={Univ of California Press}
}
@article{zhang2017yellowfin,
title={Yellowfin and the art of momentum tuning},
author={Zhang, Jian and Mitliagkas, Ioannis},
journal={SysML},
year={2018},
url={https://arxiv.org/pdf/1706.03471.pdf}
}
@article{stahl1990nth,
title={Nth Root Asymptotic Behavior of Orthonormal Polynomials},
author={Stahl, Herbert and Totik, Vilmos},
booktitle={Orthogonal polynomials},
pages={395--417},
year={1990},
publisher={Springer},
url={https://doi.org/10.1007/978-94-009-0501-6_18}
}
@article{goh2017momentum,
title={Why momentum really works},
author={Goh, Gabriel},
journal={Distill},
volume={2},
number={4},
pages={e6},
year={2017},
url={https://distill.pub/2017/momentum/}
}
@article{flanders1950numerical,
title={Numerical determination of fundamental modes},
author={Flanders, Donald A and Shortley, George},
journal={Journal of Applied Physics},
year={1950},
url={https://doi.org/10.1063/1.1699598},
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\mu}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\DeclareMathOperator{\Ima}{Im}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colormomentum}{RGB}{27, 158, 119}
\definecolor{colorstepsize}{RGB}{217, 95, 2}
\def\mom{{\color{colormomentum}m}}
\def\step{{\color{colorstepsize}h}}
$$
</div>
<h2>Gradient Descent with Momentum</h2>
<p>
Gradient descent with momentum,<dt-cite key="polyak1964some"></dt-cite> also known as heavy ball or momentum for short, is an optimization method
designed to solve unconstrained minimization problems of the form
\begin{equation}
\argmin_{\xx \in \RR^d} f(\xx)\,,
\end{equation}
where the objective function $f$ is differentiable and we have access to its gradient $\nabla
f$. In this method
the update is a sum of two terms. The first term is the
difference between the current and the previous iterate $(\xx_{t} - \xx_{t-1})$, also known as <i>momentum
term</i>. The second term is the gradient $\nabla f(\xx_t)$ of the objective function.
</p>
<p class="framed">
<b class="tufte-underline">Gradient Descent with Momentum</b><br>
<b>Input</b>: starting guess $\xx_0$, step-size $\step > 0$ and momentum
parameter $\mom \in (0, 1)$.<br>
$\xx_1 = \xx_0 - \dfrac{\step}{\mom+1} \nabla f(\xx_0)$ <br>
<b>For</b> $t=1, 2, \ldots$ compute
\begin{equation}\label{eq:momentum_update}
\xx_{t+1} = \xx_t + \mom(\xx_{t} - \xx_{t-1}) - \step\nabla
f(\xx_t)
\end{equation}
</p>
<p>
Despite its simplicity, gradient descent with momentum exhibits unexpectedly rich dynamics that we'll explore on this post. <dt-note>An excellent paper that explores the dynamics of momentum is Gabriel Goh's <a href="https://distill.pub/2017/momentum/">Why Momentum Really Works</a>. <br><br>The landscape described in the section <a href="https://distill.pub/2017/momentum/#momentum2D">"The Dynamics of Momentum"</a> are different than the ones described in this post. The ones in Gabriel's paper describe the improvement along the direction of a single eigenvector, and as such are at best a partial view of convergence (in fact, they give the misleading impression that the best convergence is achieved with zero momentum). The techniques used in this post allow instead to compute the global convergence rate, and figures like the above correctly predict a non-trivial optimal momentum term. </dt-note> <dt-cite key="goh2017momentum"></dt-cite>
</p>
<h3>Convergence rates</h3>
<p>
<a href="http://fa.bianp.net/blog/2020/momentum/">In the last post</a> we saw that, assuming $f$ is a quadratic objective of the form
\begin{equation}\label{eq:opt}
f(\xx) \defas \frac{1}{2}\xx^\top \HH \xx + \bb^\top \xx~,
\end{equation}
then we can analyze gradient descent with momentum using a convex combination of Chebyshev polynomials of the first and second kind.
</p>
<p>
More precisely, we showed that the error at iteration $t$ can be bounded as
\begin{equation}\label{eq:theorem}
\|\xx_t - \xx^\star\| \leq \underbrace{\max_{\lambda \in [\lmin, L]} |P_t(\lambda)|}_{\defas r_t} \|\xx_0 - \xx^\star\|\,.\\
\end{equation}
where $\lmin$ and $L$ are the smallest and largest eigenvalues of $\HH$ respectively and $P_t$ is a $t$-th degree polynomial that we can write as
\begin{align}
P_t(\lambda) = \mom^{t/2} \left( {\small\frac{2\mom}{1+\mom}}\, T_t(\sigma(\lambda)) + {\small\frac{1 - \mom}{1 + \mom}}\,U_t(\sigma(\lambda))\right)\,.
\end{align}
where $\sigma(\lambda) = {\small\dfrac{1}{2\sqrt{\mom}}}(1 +
\mom -
\step\,\lambda)\,$ is a linear function that we'll refer to as the <i>link function</i> and $T_t$ and $U_t$ are the Chebyshev polynomials of the first and second kind respectively.
</p>
<p>
We'll denote by $r_t$ the maximum of this polynomial over the $[\lmin, L]$ interval, and call the <i>convergence rate</i> of an algorithm the $t$-th root of this quantity, $\sqrt[t]{r_t}$. We'll also use often the limit superior of this quantity, $\limsup_{t \to \infty} \sqrt[t]{r_t}$, which we'll call the <i>asymptotic rate</i>. The asymptotic rate provides a convenient way to compare the (asymptotic) speed of convergence, as it allows to ignore other slower terms that are negligible for large $t$.
</p>
<p>
This identity between gradient descent with momentum and Chebyshev polynomials opens the door to deriving the asymptotic properties of the optimizer from those of Chebyshev polynomials.
</p>
<h2>The Two Faces of Chebyshev Polynomials</h2>
<p>
Chebyshev polynomials behave very differently inside and outside the interval $[-1, 1]$. Inside this interval (shaded blue region) the magnitude of these polynomials stays close to zero, while outside it explodes:
</p>
<figure>
<span class="marginnote">The behavior of Chebyshev polynomials of the first and second kind is drastically different inside and outside the $[-1, 1]$ interval.<br><br>
<a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/chebyshev_interval.png">
</figure>
<p>Let's make this observation more precise.</p>
<p>
<b>Inside</b> the $[-1, 1]$ interval, Chebyshev polynomials admit the <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials#Trigonometric_definition">trigonometric definitions</a> $T_t(\cos(\theta)) = \cos(t \theta)$ and $U_{t}(\cos(\theta)) = \sin((t+1)\theta) / \sin(\theta)$ and so they have an oscillatory behavior with values bounded in absolute value by 1 and $t+1$ respectively.
</p>
<p>
<b>Outside</b> of this interval the Chebyshev polynomials of the first kind admit the <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials#Explicit_expressions">explicit form</a> for $|\xi| \ge 1$:
\begin{align}
T_t(\xi) &= \dfrac{1}{2} \Big(\xi-\sqrt{\xi^2-1} \Big)^t + \dfrac{1}{2} \Big(\xi+\sqrt{\xi^2-1} \Big)^t \\
U_t(\xi) &= \frac{(\xi + \sqrt{\xi^2 - 1})^{t+1} - (\xi - \sqrt{\xi^2 - 1})^{t+1}}{2 \sqrt{\xi^2 - 1}}\,.
\end{align}
We're interested in convergence rates, so we'll look into $t$-th root asymptotics of the quantities.<dt-note>With little extra effort, it would be possible to derive non-asymptotic convergence rates, although I won't pursue this analysis here.</dt-note> Luckily, these asymptotics are the same for both polynomials<dt-note>Although we won't use it here, this $t$-th root asymptotic holds for (almost) all orthogonal polynomials, not just Chebyshev polynomials. See for instance reference below</dt-note> <dt-cite key="stahl1990nth"></dt-cite> and taking limits we have that
\begin{equation}
\lim_{t \to \infty} \sqrt[t]{|T_t(\xi)|} = \lim_{t \to \infty} \sqrt[t]{|U_t(\xi)|} = |\xi| + \sqrt{\xi^2 - 1}\,.
\end{equation}
</p>
<h2>The Robust Region</h2>
<p>
Let's start first by considering the case in which the Chebyshev polynomials are only evaluated in the $[-1, 1]$. Since the Chebsyshev polynomials are evaluated at $\sigma(\cdot)$, this implies that $|\sigma(\lambda)| \leq 1$. We'll call the set of step-size and momentum parameters for which the previous inequality is verified the <i>robust region</i>.
</p>
<p>
Let's visualize this region in a map.
Since $\sigma$ is a linear function, its extremal values are reached at the edges:
\begin{equation}
\max_{\lambda \in [\lmin, L]} |\sigma(\lambda)| = \max\{|\sigma(\lmin)|, |\sigma(L)|\}\,.
\end{equation}
Using $\lmin \leq L$ and that $\sigma(\lambda)$ is decreasing in $\lambda$, we can simplify the condition that the above is smaller than one to $\sigma(\lmin) \leq 1$ and $\sigma(L) \geq -1$, which in terms of the step-size and momentum correspond to:
\begin{equation}\label{eq:robust_region}
\frac{(1 - \sqrt{\mom})^2}{\lmin} \leq \step \leq \frac{(1 + \sqrt{\mom})^2}{L} \,.
\end{equation}
These two conditions provide the upper and lower bound of the robust region.
</p>
<figure>
<span class="marginnote">The robust region –in gray– is upper bounded by <span style="color: #377eb8">$h = \frac{(1 + \sqrt{m})^2}{L}$</span> and lower bounded <span style="color: #e41a1c">$h = \frac{(1 - \sqrt{m})^2 }{\lmin}$</span>. </span>
<img src="/images/2021/sketch_robust_region.png" alt="">
</figure>
<h3>Asymptotic rate</h3>
<p>Let $\sigma(\lambda) = \cos(\theta)$ for some $\theta$, which is always possible since $\sigma(\lambda) \in [-1, 1]$. In this regime, Chebyshev polynomials verify the identities $T_t(\cos(\theta)) = \cos(t \theta)$ and $U_t(\cos(\theta)) = \sin((t+1)\theta)/\sin(\theta)$ , which replacing in the definition of the residual polynomial gives
\begin{equation}
P_t(\sigma^{-1}(\cos(\theta))) = \mom^{t/2} \left[ {\small\frac{2\mom}{1+\mom}}\, \cos(t\theta) + {\small\frac{1 - \mom}{1 + \mom}}\,\frac{\sin((t+1)\theta)}{\sin(\theta)}\right]\,.
\end{equation}
</p>
<p>
Since the expression inside the square brackets is bounded in absolute value by $t+2$, taking $t$-th root and then limits we have $\limsup_{t \to \infty} \sqrt[t]{|P_t(\sigma^{-1}(\cos(\theta)))|} = \sqrt{\mom}$ for <i>any</i> $\theta$. This gives our first asymptotic rate:
</p>
<p class="framed">
The asymptotic rate in the robust region is $\sqrt{\mom}$.
</p>
<p>
This is nothing short of magical. It would seem natural –and this will be the case in other regions– that the speed of convergence should depend on both the step-size and the momentum parameter. Yet, this result implies that it's not the case in the robust region. In this region, the convergence <i>only</i> depends on the momentum parameter $\mom$. Amazing.<dt-note>This
insensitivity to step-size has been leveraged by Zhang et al. 2018 to develop a momentum tuner </dt-note> <dt-cite key="zhang2017yellowfin"></dt-cite>
</p>
<p>
This also illustrates why we call this the <i>robust</i> region. In its interior, perturbing the step-size in a way that we stay within the region has no effect on the convergence rate. The next figure displays the asymptotic rate (darker is faster) in the robust region.
</p>
<figure>
<span class="marginnote">In the robust region, the asymptotic rate only depends on the momentum parameter.
<br><br><a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
</span>
<img src="/images/2021/rate_robust_region.png" alt="">
</figure>
<h2>The Lazy Region</h2>
<p>
Let's consider now what happens outside of the robust region. In this case, the convergence will depend on the largest of $\{|\sigma(\lmin)|, |\sigma(L)|\}$. We'll consider first the case in which the maximum is $|\sigma(\lmin)|$ and leave the other one for next section.
</p>
<p>
This region is determined by the inequalities $|\sigma(\lmin)| > 1$ and $|\sigma(\lmin)| \geq |\sigma(L)|$.
Using the definition of $\sigma$ and solving for $\step$ gives the equivalent conditions
\begin{equation}
\step \leq \frac{2(1 + \mom)}{L + \lmin} \quad \text{ and }\quad \step \leq \frac{(1 - \sqrt{\mom})^2}{\lmin}\,.
\end{equation}
Note the second inequality is the same one as for the robust region \eqref{eq:robust_region} but with the inequality sign reversed, and so the region will be on the oposite side of that curve. We'll call this the <i>lazy region</i>, as in increasing the momentum will take us out of it and into the robust region.
</p>
<figure>
<span class="marginnote">
The lazy region depicted in gray in the left: lower bounded by zero and upper bounded by the minimum of <span style="color: #e41a1c">$h = \frac{(1 - \sqrt{m})^2}{\lmin}$</span> and <span style="color: #4daf4a">$h = \frac{2(1 + m)}{L - \lmin}$</span>.
<br><br><a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
</span>
<img src="/images/2021/sketch_lazy_region.png" alt="">
</figure>
<h3>Asymptotic rate</h3>
<p>
As we saw in <a href="#sec2">Section 2</a>, outside of the $[-1, 1]$ interval both Chebyshev have simple $t$-th root asymptotics.
Using this and that both kinds of Chebyshev polynomials agree in sign outside of the $[-1, 1]$ interval we can compute the asymptotic rate as
\begin{align}
\lim_{t \to \infty} \sqrt[t]{r_t} &= \sqrt{\mom} \lim_{t \to \infty} \sqrt[t]{ {\small\frac{2\mom}{\mom+1}}\, T_t(\sigma(\lmin)) + {\small\frac{1 - \mom}{1 + \mom}}\,U_t(\sigma(\lmin))} \\
&= \sqrt{\mom}\left(|\sigma(\lmin)| + \sqrt{\sigma(\lmin)^2 - 1} \right)\\
\end{align}
This gives the asymptotic rate for this region
</p>
<p class="framed">
In the lazy region the asymptotic rate is $\sqrt{\mom}\left(|\sigma(\lmin)| + \sqrt{\sigma(\lmin)^2 - 1} \right)$.
</p>
<p>
Unlike in the robust region, this rate depends on both the step-size and the momentum parameter, which enters in the rate through the link function $\sigma$. This can be observed in the color plot of the asymptotic rate
</p>
<figure>
<span class="marginnote">In the lazy region (unlike in the robust region), the asymptotic rate depends on both the step-size and momentum parameter.
<br><br><a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
</span>
<img src="/images/2021/rate_lazy_region.png" alt="">
</figure>
<h2>Edge of Stability</h2>
<p>
The robust and lazy region occupy most (but not all!) of the region for which momentum converges. There's a small region that sits between the lazy and robust regions and the region where momentum diverges. We call this region the <i>edge of stability</i>
</p>
<p>
For parameters not in the robust or lazy region, we have that $|\sigma(L)| > 1$ and $|\sigma(L)| > |\sigma(\lmin)|$. Using the asymptotics of Chebyshev polynomials as we did in the previous section, we have that the asymptotic rate is $\sqrt{\mom}\left(|\sigma(L)| + \sqrt{\sigma(L)^2 - 1} \right)$. The method will only converge when this asymptotic rate is below 1. Enforcing this results in $\step \lt 2 (1 + \mom) / L$. Combining this condition with the one of not being in the robust or lazy region gives the characterization:
\begin{equation}
\step \lt \frac{2 (1 + \mom)}{L} \quad \text{ and } \quad \step \geq \max\Big\{\tfrac{2(1 + \mom)}{L + \lmin}, \tfrac{(1 + \sqrt{\mom})^2}{L}\Big\}\,.
\end{equation}
</p>
<figure>
<span class="marginnote">The edge of stability is upper bounded by <span style="color: #ff7f00">$h = \frac{2(1+m)}{L}$</span> and lower bounded by <span style="color: #984ea3">$h = \max\Big\{\tfrac{2(1 + \mom)}{L + \lmin}, \tfrac{(1 + \sqrt{\mom})^2}{L}\Big\}$</span>. <br><br><a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2021/sketch_knife_edge.png" alt="">
</figure>
<h3>Asymptotic rate</h3>
<p>
The asymptotic rate can be computed using the same technique as in the lazy region. The resulting rate is the same as in that region but with $\sigma(L)$ replacing $\sigma(\lmin)$:
</p>
<p class="framed">
In the Knife's edge region the asymptotic rate is $\sqrt{\mom}\left(|\sigma(L)| + \sqrt{\sigma(L)^2 - 1} \right)$.
</p>
<p>
Pictorially, this corresponds to
</p>
<figure>
<span class="marginnote">Convergence in Knife's edge. As in the lazy region, the convergence in this region depends on both the step-size and momentum parameter.
<br><br><a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
</span>
<img src="/images/2021/rate_knife_edge.png" alt="">
</figure>
<h2>Putting it all together</h2>
<p>
This is the end of our journey. We've visited all the regions on which momentum converges.<dt-note>There's a small convergent region with <i>negative</i> momentum parameter that we haven't visited. Although not typically used for minimization, negative momentum has found applications in smooth games <a href="https://arxiv.org/abs/1807.04740">(Gidel et al., 2020)</a>.</dt-note> The only thing left to do is to combine all the asymptotic rates we've gathered along the way.
</p>
<p class="theorem"> The asymptotic rate $\limsup_{t \to \infty} \sqrt[t]{r_t}$ of momentum is
\begin{alignat}{2}
&\sqrt{\mom} &&\text{ if }\step \in \big[\frac{(1 - \sqrt{\mom})^2}{\lmin}, \frac{(1+\sqrt{\mom})^2}{L}\big]\\
&\sqrt{\mom}(|\sigma(\lmin)| + \sqrt{\sigma(\lmin)^2 - 1}) &&\text{ if } \step \in \big[0, \min\{\tfrac{2(1 + \mom)}{L + \lmin}, \tfrac{(1 - \sqrt{\mom})^2}{\lmin}\}\big]\\
&\sqrt{\mom}(|\sigma(L)| + \sqrt{\sigma(L)^2 - 1})&&\text{ if } \step \in \big[\max\big\{\tfrac{2(1 + \mom)}{L + \lmin}, \tfrac{(1 + \sqrt{\mom})^2}{L}\big\}, \tfrac{2 (1 + \mom) }{L} \big)\\
&\geq 1 \text{ (divergence)} && \text{ otherwise.}
\end{alignat}
</p>
<p>
Plotting the asymptotic rates for all regions we can see that Polyak momentum (the method with momentum $\mom = \left(\frac{\sqrt{L} - \sqrt{\lmin}}{\sqrt{L} + \sqrt{\lmin}}\right)^2$ and step-size $\step = \left(\frac{2}{\sqrt{L} + \sqrt{\lmin}}\right)^2$ which is asymptotically optimal among the momentum methods with constant coefficients) is at the intersection of the three regions.
</p>
<figure class="fullwidth">
<img style="max-width: 1000px;" src="/images/2021/rate_convergence_momentum.png" alt="">
</figure>
<p>
<span class="marginnote">
<a href="https://colab.research.google.com/gist/fabianp/0085405720c2bbd3d2c4f72a96c3d054/momentum_chebyshev.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
</span>
</p>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing it as:
</p>
<p style="margin-left: 5%">
<a href="http://fa.bianp.net/blog/2021/hitchhiker/">A Hitchhiker's Guide to Momentum</a>, Fabian Pedregosa, 2021
<p>
<p>
Bibtex entry:
</p>
<pre>
<code>
@misc{pedregosa2021residual,
title={A Hitchhiker's Guide to Momentum},
author={Pedregosa, Fabian},
howpublished = {\url{http://fa.bianp.net/blog/2021/hitchhiker/}},
year={2021}
}
</code>
</pre>
<p>
<b>Acknowledgements.</b>
Thanks to <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a> for proofreading, catching many typos and making excellent clarifying suggestions, to <a href="https://gowerrobert.github.io/">Robert Gower</a> for proofreading this post and providing thoughtful feedback. Thanks also to <a href="https://cypaquette.github.io/">Courtney Paquette</a> and <a href="https://www.di.ens.fr/~ataylor/">Adrien Taylor</a> for reporting errors.
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
On the Link Between Optimization and Polynomials, Part 22020-12-21T00:00:00+01:002020-12-21T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2020-12-21:/blog/2020/momentum/
<p>
We can tighten the analysis of gradient descent with momentum through a cobination of Chebyshev polynomials of the first and second kind. Following this connection, we'll derive one of the most iconic methods in optimization: Polyak momentum.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath …</script>
<p>
We can tighten the analysis of gradient descent with momentum through a cobination of Chebyshev polynomials of the first and second kind. Following this connection, we'll derive one of the most iconic methods in optimization: Polyak momentum.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{Rutishauser1959,
author="Rutishauser, H.",
title="Theory of Gradient Methods",
journal="Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-Adjoint Boundary Value Problems",
year="1959",
url="https://doi.org/10.1007/978-3-0348-7224-9_2"
}
@article{flanders1950numerical,
title={Numerical determination of fundamental modes},
author={Flanders, Donald A and Shortley, George},
journal={Journal of Applied Physics},
year={1950},
url={https://sci-hub.tw/10.1063/1.1699598},
}
@book{fischer1996polynomial,
title={Polynomial based iteration methods for symmetric linear systems},
author={Fischer, Bernd},
year={1996},
url={https://doi.org/10.1007/978-3-663-11108-5},
journal={Springer}
}
@article{hestenes1952methods,
title={Methods of conjugate gradients for solving linear systems},
author={Hestenes, Magnus and Stiefel, Eduard},
journal={Journal of research of the National Bureau of Standards},
year={1952},
url={https://pdfs.semanticscholar.org/466d/addfb6340c28cb8da548007028c8cc5df687.pdf}
}
@article{pedregosa2020average,
title={Average-case Acceleration Through Spectral Density Estimation},
author={Pedregosa, Fabian and Scieur, Damien},
journal={arXiv preprint arXiv:2002.04756},
year={2020},
url={https://arxiv.org/pdf/2002.04756.pdf}
}
@article{scieur2020universal,
title={Universal Average-Case Optimality of Polyak Momentum},
author={Scieur, Damien and Pedregosa, Fabian},
journal={arXiv preprint arXiv:2002.04664},
year={2020},
url={https://arxiv.org/pdf/2002.04664.pdf}
}
@article{polyak1964some,
title={Some methods of speeding up the convergence of iteration methods},
author={Polyak, Boris},
journal={USSR Computational Mathematics and Mathematical Physics},
year={1964},
url={https://doi.org/10.1016/0041-5553(64)90137-5}
}
@article{polyak1987introduction,
title={Introduction to Optimization},
author={Polyak, Boris},
journal={Optimization Software, Inc. Publications Division, New York},
url={https://b-ok.cc/book/2461679/c8b7e4},
year={1987}
}
@article{frankel1950convergence,
title={Convergence rates of iterative treatments of partial differential equations},
author={Frankel, Stanley},
journal={Mathematical Tables and Other Aids to Computation},
year={1950},
url={https://www.jstor.org/stable/2002770},
publisher={JSTOR}
}
@inproceedings{sutskever2013importance,
title={On the importance of initialization and momentum in deep learning},
author={Sutskever, Ilya and Martens, James and Dahl, George and Hinton, Geoffrey},
journal={International conference on machine learning},
year={2013},
url={http://proceedings.mlr.press/v28/sutskever13.pdf}
}
@article{zhang20202dive,
title={Dive into Deep Learning},
author={Zhang, Aston and Lipton, Zachary C and Li, Mu and Smola, Alexander J},
journal={https://d2l.ai},
year={2020},
url={https://d2l.ai/chapter_optimization/momentum.html}
}
@book{elaydi2005introduction,
title={An introduction to difference equations},
author={Elaydi, Saber},
year={2005},
journal={Springer}
}
@article{perron1921summengleichungen,
title={{\"U}ber summengleichungen und Poincar{\'e}sche differenzengleichungen},
author={Perron, Oskar},
journal={Mathematische Annalen},
year={1921},
publisher={Springer}
}
@article{hochstrasser1954anwendung,
title={Die Anwendung der Methode der konjugierten Gradienten und ihrer Modifikationen auf die Lösung linearer Randwertprobleme},
author={Hochstrasser, Urs},
year={1954},
journal={Doctoral Thesis (in German), ETH Zurich},
url={https://doi.org/10.3929/ethz-a-000091966}
}
@inproceedings{ghadimi2015global,
title={Global convergence of the heavy-ball method for convex optimization},
author={Ghadimi, Euhanna and Feyzmahdavian, Hamid Reza and Johansson, Mikael},
booktitle={2015 European Control Conference (ECC)},
year={2015},
url={https://arxiv.org/abs/1412.7457},
}
@inproceedings{flammarion2015averaging,
title={From averaging to acceleration, there is only a step-size},
author={Flammarion, Nicolas and Bach, Francis},
booktitle={Conference on Learning Theory},
year={2015}
}
@book{khrushchev2008orthogonal,
title={Orthogonal polynomials and continued fractions: from Euler's point of view},
author={Khrushchev, Sergey},
year={2008},
url={https://www.maths.ed.ac.uk/~v1ranick/papers/khrushchev.pdf},
publisher={Cambridge University Press}
}
@article{totik2005orthogonal,
title={Orthogonal polynomials},
author={Totik, Vilmos},
journal={arXiv preprint math/0512424},
year={2005},
url={https://arxiv.org/pdf/math/0512424.pdf}
}
@book{suli2003introduction,
title={An introduction to numerical analysis},
author={Süli, Endre and Mayers, David},
year={2003},
publisher={Cambridge University Press},
url={https://www.cambridge.org/core/books/an-introduction-to-numerical-analysis/FD8BCAD7FE68002E2179DFF68B8B7237}
}
@book{deift1999orthogonal,
title={Orthogonal polynomials and random matrices: a Riemann-Hilbert approach},
author={Deift, Percy},
volume={3},
year={1999},
publisher={American Mathematical Soc.},
url={https://books.google.ca/books?id=SBR8yv0LkFgC&lpg=PP1&pg=PP1#v=onepage&q&f=false}
}
@article{legendre1782,
title={Recherches sur l'attraction des sphéroïdes homogènes},
author={Legendre, Adrien-Marie},
year={1782},
url={https://web.archive.org/web/20090920070434/http://edocs.ub.uni-frankfurt.de/volltexte/2007/3757/pdf/A009566090.pdf},
journal={Mémoires de Mathématiques et de Physique, présentés à l'Académie Royale des Sciences, par divers savans, et lus dans ses Assemblées}
}
@book{szeg1975orthogonal,
title={Orthogonal polynomials},
author={Szegő, Gábor},
year={1975},
journal={American Mathematical Soc.},
url={https://people.math.osu.edu/nevai.1/SZEGO/szego=szego1975=ops=OCR.pdf}
}
@book{gautschi2004orthogonal,
title={Orthogonal polynomials},
author={Gautschi, Walter},
year={2004},
journal={Oxford University Press},
url={https://global.oup.com/academic/product/orthogonal-polynomials-9780198506720?cc=ca&lang=en&}
}
@book{chihara2011introduction,
title={An introduction to orthogonal polynomials},
author={Chihara, Theodore S},
year={2011},
publisher={Courier Corporation},
url={https://books.google.ca/books?id=71CVAwAAQBAJ&lpg=PP1&pg=PP1#v=onepage&q&f=false}
}
@article{marcellan2001favard,
title={On the “Favard theorem” and its extensions},
author={Marcellán, Francisco and Álvarez-Nodarse, Renato},
journal={Journal of computational and applied mathematics},
year={2001},
url={http://merlin.us.es/~renato/papers/fav-jcam.pdf},
publisher={Elsevier}
}
@article{favard1935,
author = {Jean Favard},
title = {Sur les polynomes de Tchebicheff},
journal = {Comptes Rendus Hebdomadaires des Séances de l'Académie des Sciences, Paris},
url={https://gallica.bnf.fr/ark:/12148/bpt6k3152t/f2052.item},
volume = {200},
year = {1935},
publisher = {Gauthier-Villars, Paris},
}
@article{cretney2014origins,
title={The origins of Euler's early work on continued fractions},
author={Cretney, Rosanna},
journal={Historia Mathematica},
url={https://doi.org/10.1016/j.hm.2013.12.004},
pages={139--156},
year={2014},
publisher={Elsevier}
}
@book{nevai1979orthogonal,
title={Orthogonal polynomials},
author={Nevai, Paul G},
volume={213},
year={1979},
journal={American Mathematical Soc.},
url={https://books.google.ca/books?id=hxTUCQAAQBAJ&lpg=PP1&pg=PP1#v=onepage&q&f=false}
}
@book{mhaskar1997introduction,
title={Introduction to the theory of weighted polynomial approximation},
author={Mhaskar, Hrushikesh Narhar},
volume={7},
year={1997},
publisher={World Scientific}
}
@article{aptekarev1994spatial,
title={Spatial entropy of central potentials and strong asymptotics of orthogonal polynomials},
author={Aptekarev, AI and Dehesa, JS and Yáñez, RJ},
journal={Journal of Mathematical Physics},
volume={35},
number={9},
pages={4423--4428},
year={1994},
publisher={American Institute of Physics},
url={https://aip.scitation.org/doi/10.1063/1.530861}
}
@book{grenander1958toeplitz,
title={Toeplitz forms and their applications},
author={Grenander, Ulf and Szegő , Gábor},
year={1958},
publisher={Univ of California Press}
}
@article{van1991orthogonal,
title={Orthogonal polynomials, associated polynomials and functions of the second kind},
author={Van Assche, Walter},
journal={Journal of computational and applied mathematics},
year={1991},
publisher={Elsevier},
url={https://doi.org/10.1016/0377-0427(91)90121-Y}
}
@article{zhang2017yellowfin,
title={Yellowfin and the art of momentum tuning},
author={Zhang, Jian and Mitliagkas, Ioannis},
journal={SysML},
year={2018},
url={https://arxiv.org/pdf/1706.03471.pdf}
}
@article{paquette2020halting,
title={Halting Time is Predictable for Large Models: A Universality Property and Average-case Analysis},
author={Paquette, Courtney and van Merriënboer, Bart and Paquette, Elliot and Pedregosa, Fabian},
journal={Foundations of Computational Mathematics},
url={https://arxiv.org/pdf/2006.04299.pdf},
year={2021}
}
@article{goh2017momentum,
title={Why momentum really works},
author={Goh, Gabriel},
journal={Distill},
year={2017},
url={https://distill.pub/2017/momentum/}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\AA{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\mu}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
\definecolor{colormomentum}{RGB}{27, 158, 119}
\definecolor{colorstepsize}{RGB}{217, 95, 2}
$$
</div>
<h2>Gradient Descent with Momentum</h2>
<p>
Gradient descent with momentum, also known as heavy ball or momentum for short, is an optimization
method
designed to solve unconstrained optimization problems of the form
\begin{equation}
\argmin_{\xx \in \RR^d} f(\xx)\,,
\end{equation}
where the objective function $f$ is differentiable and we have access to its gradient $\nabla
f$. In this method
the udpate is a sum of two terms. The first term is the
difference between current and previous iterate $(\xx_{t} - \xx_{t-1})$, also known as <i>momentum
term</i>. The second term is the gradient $\nabla f(\xx_t)$ of the objective function.
</p>
<p class="framed">
<b class="tufte-underline">Gradient Descent with Momentum</b><br>
<b>Input</b>: starting guess $\xx_0$, step-size ${\color{colorstepsize} h} > 0$ and momentum
parameter ${\color{colormomentum} m} \in (0, 1)$.<br>
$\xx_1 = \xx_0 - \dfrac{{\color{colorstepsize} h}}{1 + {\color{colormomentum} m}} \nabla f(\xx_0)$ <br>
<b>For</b> $t=1, 2, \ldots$ compute
\begin{equation}\label{eq:momentum_update}
\xx_{t+1} = \xx_t + {\color{colormomentum} m}(\xx_{t} - \xx_{t-1}) - {\color{colorstepsize}
h}\nabla
f(\xx_t)
\end{equation}
</p>
<p>
Gradient descent with momentum has seen a renewed interest in recent years, as a stochastic variant that replaces the true gradient with a stochastic estimate has shown to be very effective for deep learning.<dt-cite key="sutskever2013importance"></dt-cite> <dt-cite key="zhang20202dive"></dt-cite>
</p>
<h3>Quadratic model</h3>
<p>
Although momentum can be applied to any problem with a twice differentiable objective, in this post we'll assume for simplicity the objective function $f$ quadratic.<dt-note>The extension to non-quadratic functions follows simple local arguments and can be found for example in <a href="https://www.researchgate.net/publication/342978480_Introduction_to_Optimization">Polyak's book</a></dt-note> The quadratic assumption is restrictive, but well worth it, as it allows a richer theory and a simplified analysis through the connection between optimization methods and polynomials.
As in the <a href="/blog/2020/polyopt/">previous post</a>, we'll assume the objective is
\begin{equation}\label{eq:opt}
f(\xx) \defas \frac{1}{2}\xx^\top \HH \xx + \bb^\top \xx~,
\end{equation}
where $\HH$ is a positive definite square matrix with eigenvalues in the interval $[\lmin, L]$.</span>
</p>
<p>
<b>Example: Polyak Momentum</b>, also known as the HeavyBall method, is a widely used instance of momentum. Originally developed for the solution of linear equations and known as Frankel's method,<dt-cite key="frankel1950convergence"></dt-cite>
<dt-cite key="hochstrasser1954anwendung"></dt-cite>
<dt-cite key="Rutishauser1959"></dt-cite> it was generalized to general objectives and popularized in the optimization community by Boris Polyak.<dt-note> <img src="/images/2020/polyak.jpeg" alt="Boris Polyak" style="display: block; margin: 0 auto; max-width: 160px; box-shadow: 6px 6px 3px grey;"> <br> <a href="http://lab7.ipu.ru/eng/people/polyak.html">Boris Polyak</a> (1935 —) is a Russian mathematician and pioneer of optimization. Currently, he leads the Laboratory at Institute of Control Science in Moscow. <br>
</dt-note> <dt-cite key="polyak1964some">
</dt-cite>
<dt-cite key="polyak1987introduction"></dt-cite>
</p>
<p>Polyak momentum follows the update \eqref{eq:momentum_update} with momentum and step-size parameters
\begin{equation}\label{eq:polyak_parameters}
{\color{colormomentum}m = {\Big(\frac{\sqrt{L}-\sqrt{\lmin}}{\sqrt{L}+\sqrt{\lmin}}\Big)^2}}
\text{ and }
{\color{colorstepsize}h = \Big(\frac{ 2}{\sqrt{L}+\sqrt{\lmin}}\Big)^2}\,,
\end{equation}
which gives the full algorithm
</p>
<p class="framed">
<b class="tufte-underline">Polyak Momentum</b><br>
<b>Input</b>: starting guess $\xx_0$, lower and upper eigenvalue bound $\lmin$ and $L$.<br>
$\xx_1 = \xx_0 - {\color{colorstepsize}\frac{2}{L + \lmin}} \nabla f(\xx_0)$ <br>
<b>For</b> $t=1, 2, \ldots$ compute
\begin{equation}\label{eq:polyak_momentum_update}
\xx_{t+1} = \xx_t + {\color{colormomentum} {\Big(\tfrac{\sqrt{L}-\sqrt{\lmin}}{\sqrt{L}+\sqrt{\lmin}}\Big)^2}}(\xx_{t} - \xx_{t-1}) - {\color{colorstepsize} \Big(\tfrac{ 2}{\sqrt{L}+\sqrt{\lmin}}\Big)^2}\nabla
f(\xx_t)
\end{equation}
</p>
<h3>Residual Polynomials</h3>
<p>
In the <a href="/blog/2020/polyopt/">last post</a> we saw that to each gradient-based optimizer we can associate a sequence of
polynomials $P_1, P_2,
...$ that we named <i>residual polynomials</i> and that determines the converge of the method:
</p>
<p class="lemma" text="Hestenes and Stiefel, 1952">
<dt-cite key="hestenes1952methods"></dt-cite>
Let $\xx_1, \xx_2, \ldots$ be the iterates generated by a momentum method and $P_1, P_2, \ldots$
the sequence of
residual polynomials associated with this method. Then the error at each iteration $t$ verifies
\begin{equation}
\xx_t - \xx^\star = P_t(\HH)(\xx_0 - \xx^\star)\,.
\end{equation}
</p>
<p class="proof">
This is a special case of the Residual polynomial Lemma proved <a href="/blog/2020/polyopt/#lemma_residual_polynomial">in the last post</a>.
</p>
<p>
In the <a href="/blog/2020/polyopt/#lemma_residual_polynomial">last post</a> we also gave an expression for this residual polynomial. Although this expression is rather involved in the general case, it simplifies for momentum, where it follows the simple recurrence:
</p>
<p class="lemma" text="Momentum Residual Polynomials" id="momentumresidual">
Given a momentum method with parameters
${\color{colormomentum} m}$, ${\color{colorstepsize}
h}$, the associated residual polynomials $P_0, P_1, \ldots$ verify the following recursion
\begin{equation}
\begin{split}
&P_{t+1}(\lambda) = (1 + {\color{colormomentum}m} - {\color{colorstepsize} h} \lambda ) P_{t}(\lambda) -
{\color{colormomentum}m} P_{t-1}(\lambda)\\
&P_0(\lambda) = 1\,,~ P_1(\lambda) = 1 - \frac{{\color{colorstepsize} h}}{1 + {\color{colormomentum}m}} \lambda \,.
\end{split}\label{eq:def_residual_polynomial2}
\end{equation}
</p>
<p>
A consequence of this result is that not only the method, but also its residual polynomial, has a simple recurrence. We will soon leverage this to simplify the analysis.
</p>
<h2>Chebyshev meets Chebyshev</h2>
<p>
<a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials">Chebyshev polynomials</a> of the first kind appeared in the last post when deriving the optimal worst-case method. These are polynomials
$T_0, T_1, \ldots$ defined by the recurrence relation
\begin{align}
&T_0(\lambda) = 1 \qquad T_1(\lambda) = \lambda\\
&T_{k+1}(\lambda) = 2 \lambda T_k(\lambda) - T_{k-1}(\lambda)~.
\end{align}
A fact that we didn't use in the last post is that Chebyshev polynomials are orthogonal with respect to the weight function
\begin{equation}
\dif\omega(\lambda) = \begin{cases} \frac{1}{\pi\sqrt{1 - \lambda^2}} &\text{ if $\lambda \in
[-1, 1]$} \\ 0 &\text{ otherwise}\,. \end{cases}
\end{equation}
That is, they verify the orthogonality condition
\begin{equation}
\int T_i(\lambda) T_j(\lambda) \dif\omega(\lambda) \begin{cases} > 0 \text{ if $i = j$}\\ = 0
\text{ otherwise}\,.\end{cases}
\end{equation}
</p>
<p>
I mentioned that these are the Chebyshev polynomials of the first kind because there are
also Chebyshev polynomials <i>of the second kind</i>, and although less used, also of the third
kind, fourth kind and so on.
</p>
<p>
Chebyshev polynomials of the second kind, denoted $U_1, U_2, \ldots$, are defined by the
integral
\begin{equation}
U_t(\lambda) = \int \frac{T_{t+1}(\xi) - T_{t+1}(\lambda)}{\xi- \lambda}\dif\omega(\xi)\,.
\end{equation}
</p>
<p>
Although we will only use this construction for Chebyshev polynomials, it extends beyond this
setup and is very natural because these polynomials are the numerators for the convergents of
certain continued fractions.<dt-cite key="van1991orthogonal"></dt-cite>
</p>
<p>
The polynomials of the second kind verify the same recurrence formula as the original sequence,
but with different initial conditions. In particular, the Chebyshev polynomials of the second
kind are determined by the recurrence
\begin{align}
&U_0(\lambda) = 1 \qquad U_1(\lambda) = 2 \lambda\\
&U_{k+1}(\lambda) = 2 \lambda U_k(\lambda) - U_{k-1}(\lambda)~.
\end{align}
Note how $U_1(\lambda)$ is different from $T_1$, but other than that, the coefficients in the
recurrence are the same. Because of this, both sequences behave somewhat similarly, but also have important differences. For example, both kinds have extrema at the endpoints, but the value is different. For the first kind, the
extrema is $\pm 1$, while for the second kind its $\pm (t+1)$.
</p>
<figure>
<span class="marginnote"><br>Image of the first 6 Chebyshev polynomials of the first and second
kind in the $[-1, 1]$ interval. <br><br>
<!-- Both kinds exhibit an oscillatory behavior that can in fact be charterized using trigonometric functions. The first kind satisfies $T_t(\cos(\theta)) = \cos(t \theta)$, while for the second kind verifies $U_t(\cos(\theta)) \sin(\theta) = \sin((t+1)\theta)$.
<br><br> -->
<a href="https://colab.research.google.com/gist/fabianp/f33a3a8fd933444c02888b531aee85b8/polynomials_acceleration.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
<img src="/images/2020/chebyshev.svg" alt="Chebyshev polynomials">
<img src="/images/2020/chebyshev2.svg" alt="Chebyshev polynomials">
</figure>
<h2>Momentum's residual polynomial</h2>
<p>
Time has finally come for the main result: the residual polynomial of
gradient descent with momentum is nothing else than a combination of Chebyshev polynomials of the first and second
kind. Since the properties of Chebyshev polynomials are well known, this will provide a way to derive convergence bounds for momentum.
</p>
<p class="theorem framed">
Consider the momentum method with step-size ${\color{colorstepsize}h}$ and momentum parameter ${\color{colormomentum}m}$.
The residual polynomial $P_t$ associated with this method
can be written in terms of Chebyshev
polynomials as
\begin{equation}\label{eq:theorem}
P_t(\lambda) = {\color{colormomentum}m}^{t/2} \left( {\small\frac{2
{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
T_t(\sigma(\lambda))
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
U_t(\sigma(\lambda))\right)\,,
\end{equation}
with $\sigma(\lambda) = {\small\dfrac{1}{2\sqrt{{\color{colormomentum}m}}}}(1 +
{\color{colormomentum}m} -
{\color{colorstepsize} h}\,\lambda)\,$.
</p>
<div class="proof">
<p>
Let's denote by $\widetilde{P}_t$ the right hand side of the above equation, that is,
\begin{equation}
\widetilde{P}_{t}(\lambda) \defas {\color{colormomentum}m}^{t/2} \left( {\small\frac{2
{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
T_t(\sigma(\lambda))
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
U_t(\sigma(\lambda))\right)\,.
\end{equation}
Our goal is to show that $P_t = \widetilde{P}_t$ for all $t$.
</p>
<p>
For $t=1$, $T_1(\lambda) = \lambda$ and $U_1(\lambda) = 2\lambda$, so we have
\begin{align}
\widetilde{P}_1(\lambda) &= \sqrt{{\color{colormomentum}m}} \left(\tfrac{2
{\color{colormomentum}m}}{1 + {\color{colormomentum}m}} \sigma(\lambda) + \tfrac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}} 2
\sigma(\lambda)\right)\\
&= \frac{2 \sqrt{{\color{colormomentum}m}}}{1 + {\color{colormomentum}m}} \sigma(\lambda) = 1 - \frac{{\color{colorstepsize}h}}{1 + {\color{colormomentum}m}} \lambda\,,
\end{align}
which corresponds to the definition of $P_1$ in \eqref{eq:def_residual_polynomial2}.
</p>
<p>
Assume it's true for any iteration up to $t$, we will show it's true for $t+1$. Using the three-term recurrence of Chebyshev polynomials we have
\begin{align}
&\widetilde{P}_{t+1}(\lambda) = {\color{colormomentum}m}^{(t+1)/2} \left( {\small\frac{2 {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
T_{t+1}(\sigma(\lambda))
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, U_{t+1}(\sigma(\lambda))\right) \\
&= {\color{colormomentum}m}^{(t+1)/2} \Big( {\small\frac{2
{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
(2 \sigma(\lambda) T_{t}(\sigma(\lambda)) - T_{t-1}(\sigma(\lambda))) \nonumber\\
&\qquad\qquad
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, (2 \sigma(\lambda)
U_{t}(\sigma(\lambda)) - U_{t-1}(\sigma(\lambda)))\Big)\\
&= 2 \sigma(\lambda) \sqrt{{\color{colormomentum}m}} P_t(\lambda) - {\color{colormomentum}m} P_{t-1}(\lambda)\\
&= (1 + {\color{colormomentum}m} - {\color{colorstepsize}h} \lambda) P_t(\lambda) -
{\color{colormomentum}m} P_{t-1}(\lambda)
\end{align}
where the third identity follows from grouping polynomials of same degree and the
induction hypothesis. The last expression is the recursive definition of $P_{t+1}$ in
\eqref{eq:def_residual_polynomial2}, which proves the desired $\widetilde{P}_{t+1} =
{P}_{t+1}$.
</p>
</div>
<p>
<b>History bits.</b> The theorem above is a generalization<dt-note>If you have seen this result elsewhere, please leave a comment, I would be very interested to know.</dt-note> of an existing result for Polyak momentum used without proof by
Rutishauser in his 1959 book.<dt-cite key="Rutishauser1959"></dt-cite>
</p>
<figure>
<span class="marginnote"><b>Left figure</b>. Excerpt from Rutishauser's 1959 book, where he
identifies the residual polynomial corresponding to Frankel's method (equivalent to Polyak
momentum on quadratic functions).</span>
<img style="box-shadow: 6px 6px 3px grey;" src="/images/2020/rutishauser_chebyshevs.png" alt="">
</figure>
<p>
More recently, in a collaboration with <a href="https://cypaquette.github.io/">Courtney</a>, <a href="https://scholar.google.ca/citations?user=XE9SDzgAAAAJ&hl=en">Bart</a> and <a href="https://elliotpaquette.github.io/">Elliot</a> we used Ruthishauser's expression to derive an <i>average-case</i> analysis of Polyak momentum.<dt-cite key="paquette2020halting"></dt-cite> In that work we also have similar expression for other accelerated methods like Nesterov acceleration.
</p>
<h2>Convergence Rates and Robust Region</h2>
<p>
In the last post se introduced the <a href="/blog/2020/polyopt/#convergence-rate">worst-case
convergence rate</a> as a tool to compare speed of convergence of optimization algorithms.
These are an upper bound on the per-iteration progress of an optimization algorithm. In other
words, a small convergence rate guarantees a fast convergence. Convergence rates can be
expressed in term of the residual polynomial as
\begin{equation}
r_t \defas \max_{\lambda \in [\lmin, L]} |P_t(\lambda)|\,,
\end{equation}
where $\lmin, L$ are the smallest and largest eigenvalues of $\HH$ respectively. Using Hestenes and Stiefel's lemma, we can see that $r_t$ is an upper bound on the norm of the error in the sense that it verifies
\begin{equation}
\|\xx_t - \xx^\star\| \leq r_t\,\|\xx_0 - \xx^\star\|\,.
\end{equation}
</p>
<p>
Deriving convergence rates is hard, but the previous Theorem allows us to convert this problem into the one of bounding Chebyshev polynomials. Hopefully, this problem is easier as we know a lot about Chebyshev polynomials. In particular, replacing $P_t$ with \eqref{eq:theorem} in the definition of convergence rate and upper bounding the joint maximum with a separate one on $T_t$ and $U_t$ we have
\begin{equation}
r_t \leq {\color{colormomentum}m}^{t/2} \left( {\small\frac{2
{\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\,
\max_{\lambda \in [\lmin, L]}|T_t(\sigma(\lambda))|
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, \max_{\lambda \in
[\lmin, L]}|U_t(\sigma(\lambda))|\right)\,.
\end{equation}
In this equation we know everything except $\max_{\lambda \in [\lmin, L]}|T_t(\sigma(\lambda))|$
and $\max_{\lambda \in [\lmin, L]}|U_t(\sigma(\lambda))|$.
</p>
<p>
Luckily, we have all sorts of bounds on Chebyshev polynomials. One of the simplest and most
useful one bounds these polynomials in the $[-1, 1]$ interval.
</p>
<figure>
<span class="marginnote">Chebyshev polynomials have differently behavior inside and outside the $[-1, 1]$ interval. <br><br>This interval (shaded blue region in left plot) contains all the zeros of both polynomials, and they're bounded either by a constant or by a term that scales linearly with the iteration (Eq. \eqref{eq:bound_chebyshev}). <br><br>Outside of this interval, both polynomials grow at a rate that is exponential in the number of iterations. <br><br>
<a href="https://colab.research.google.com/gist/fabianp/f33a3a8fd933444c02888b531aee85b8/polynomials_acceleration.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
<img src="/images/2020/chebyshev_interval.png" alt="">
</figure>
<p>
Inside the interval $\xi \in [-1, 1]$, we have the
bounds
\begin{equation}\label{eq:bound_chebyshev}
|T_t(\xi)| \leq 1 \quad \text{ and } \quad |U_{t}(\xi)| \leq t+1\,.
\end{equation}
In our case, the Chebyshev polynomials are evaluated at $\sigma(\lambda)$, so these bounds are
valid whenever $\max_{\lambda \in [\lmin, L]}|\sigma(\lambda)| \leq 1$. Since $\sigma$ is a linear function of $\lambda$, its extremal points are reached at the edges of the interval, and so the previous condition is equivalent to $|\sigma(\lmin)| \leq 1 $ and $|\sigma(L)| \leq 1$, which reordering gives the conditions
\begin{equation}\label{eq:condition_stable_region}
\frac{(1 - \sqrt{{\color{colormomentum}m}})^2 }{{\color{colorstepsize}h}} \leq \lmin \quad \text{ and }\quad L
\leq~\frac{(1 + \sqrt{{\color{colormomentum}m}})^2 }{{\color{colorstepsize}h}}\,.
\end{equation}
We will refer to the set of parameters that
verify the above as the <i>robust region</i>. This is a subset of the set of admissible parameters –that is, those for which the method converges–. The set of admissible parameters is<dt-cite key="polyak1964some"></dt-cite>
\begin{equation}
0 \leq {\color{colormomentum}m} \lt 1 \quad \text{ and }\quad {\color{colorstepsize}h} \lt \frac{2 (1 + {\color{colormomentum}m})}{L}\,.
\end{equation}
</p>
<figure class="">
<span class="marginnote">
The robust region is the set of momentum (${\color{colormomentum}m}$) and step-size parameters
(${\color{colorstepsize}h}$) that satisfy inequality \eqref{eq:condition_stable_region}. It's a subset of the set of parameters for which momentum converges (admissible region).
<br><br>
Example of robust region for $\lmin=0.1, L=1$.
<br><br>
<a href="https://colab.research.google.com/gist/fabianp/f33a3a8fd933444c02888b531aee85b8/polynomials_acceleration.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</span>
<img src="/images/2020/robust_region.png" alt="">
</figure>
<p>
In the robust region, can use the bounds above and we have the rate
\begin{equation}\label{eq:convergence_rate_stable}
r_t \leq {\color{colormomentum}m}^{t/2} \left( 1
+ {\small\frac{1 - {\color{colormomentum}m}}{1 + {\color{colormomentum}m}}}\, t\right)
\end{equation}
This convergence rate is somewhat surprising, as one might expect
that the convergence depends on both the <span style="color: #d95f02">step-size</span> and <span style="color: #1b9e77">momentum</span>. Instead, in the robust region the convergence rate
does not depend on the step-size. This
insensitivity to step-size can be leveraged for example to develop a momentum tuner.<dt-cite key="zhang2017yellowfin"></dt-cite>
</p>
<p>
Given that we would like the convergence rate to be as small as possible, it's natural to choose
the momentum term as small as possible while staying in the robust region. But just how small
can we make this momentum term?
</p>
<h2>Polyak strikes back</h2>
<p>
This post ends as it starts: with Polyak momentum. We will see that minimizing the convergence
rate while staying in the robust region will lead us naturally to this method.
</p>
<p>Consider the inequalities that describe the robust region:
\begin{equation}\label{eq:robust_region_2}
\frac{(1 - \sqrt{{\color{colormomentum}m}})^2 }{{\color{colorstepsize}h}} \leq \lmin \quad \text{ and }\quad L
\leq~\frac{(1 + \sqrt{{\color{colormomentum}m}})^2 }{{\color{colorstepsize}h}}\,.
\end{equation}
Solving for ${\color{colorstepsize}h}$ on both equations allows us to describe the admissible values of ${\color{colormomentum}m}$ as a single inequality:
\begin{equation}
\left(\frac{1 - \sqrt{{\color{colormomentum}m}} }{1 + \sqrt{{\color{colormomentum}m}}}\right)^2 \leq \lmin L\,.
\end{equation}
The left hand side is monotonicallydecreasing in ${\color{colormomentum}m}$. Since our objective is to <i>minimize</i> ${\color{colormomentum}m}$, this will be achieved when the inequality becomes an equality, which gives
\begin{equation}
{\color{colormomentum}m = {\Big(\frac{\sqrt{L}-\sqrt{\lmin}}{\sqrt{L}+\sqrt{\lmin}}\Big)^2}}
\end{equation}
Finally, the step-size parameter ${\color{colorstepsize}h}$ can be computed from plugging the above momentum parameter into either one of the inequalities \eqref{eq:robust_region_2}:
\begin{equation}
{\color{colorstepsize}h = \Big(\frac{ 2}{\sqrt{L}+\sqrt{\lmin}}\Big)^2}\,.
\end{equation}
These are the step-size and momentum parameter of Polyak momentum of \eqref{eq:polyak_parameters}.
</p>
<p>
This provides an alternative derivation of Polyak momentum from Chebyshev polynomials.<dt-note>Another way to derive Polyak momentum
consists in taking the limit $t\to \infty$ of the Chebyshev iterative method derived in
the previous post. See for instance paragraph "Simpler stationary recursion" in <a href="https://francisbach.com/chebyshev-polynomials/">Francis' blog post</a>.</dt-note> One advantage of this derivation is that we obtain the
convergence rate for free. Since this combination of step-size and momentum belongs to the
robust region, we can apply \eqref{eq:convergence_rate_stable}, which gives
\begin{align}
r_t^{\text{Polyak}} &\leq \left({\small\frac{\sqrt{L} - \sqrt{\lmin}}{\sqrt{L} +
\sqrt{\lmin}}}\right)^t\left({\small 1 + t\frac{2 \sqrt{L \lmin}}{L + \lmin}}\right)
\label{eq:nonasymptotic_polyak}\,.
\end{align}
This implies that the convergence rate of Polyak momentum –as the Chebyshev method– is asymptotically bounded as
\begin{equation}
\limsup_{t \to \infty} \sqrt[t]{r_t^{\text{Polyak}}} \leq {\small\frac{\sqrt{L} - \sqrt{\lmin}}{\sqrt{L} +
\sqrt{\lmin}}}\,,
\end{equation}
which is a whopping square root
factor improvement over gradient descent's $\frac{\lmax - \lmin}{\lmax +
\lmin}$ convergence rate rate.
</p>
<h3>Stay Tuned</h3>
<p>
In this post we've only scratched the surface of the rich dynamics of momentum.
In the next post we will explore dig deeper into the robust region and explore the other regions too.
</p>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing it as:
</p>
<p style="margin-left: 5%">
<a href="http://fa.bianp.net/blog/2020/momentum/">Momentum: when Chebyshev meets Chebyshev.</a>, Fabian Pedregosa, 2020
<p>
<p>
Bibtex entry:
</p>
<pre>
<code>
@misc{pedregosa2021residual,
title={Momentum: when Chebyshev meets Chebyshev},
author={Pedregosa, Fabian},
howpublished = {\url{http://fa.bianp.net/blog/2020/momentum/}},
year={2020}
}
</code>
</pre>
<p>
<b>Acknowledgements.</b> Thanks to <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a> for reporting many typos and making excellent clarifying suggestions, and to
<a href="http://nicolas.le-roux.name/">Nicolas Le Roux</a> for catching some mistakes in an initial version of this post and providing thoughtful feedback.
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
On the Link Between Polynomials and Optimization, Part 12020-04-07T00:00:00+02:002020-04-07T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2020-04-07:/blog/2020/polyopt/
<p>
There's a fascinating link between minimization of quadratic functions and polynomials. A link
that goes
deep and allows to phrase optimization problems in the language of polynomials and vice versa.
Using this connection, we can tap into centuries of research in the theory of polynomials and
shed new light on …</p>
<p>
There's a fascinating link between minimization of quadratic functions and polynomials. A link
that goes
deep and allows to phrase optimization problems in the language of polynomials and vice versa.
Using this connection, we can tap into centuries of research in the theory of polynomials and
shed new light on old problems.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js", "color.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{fletcher2005barzilai,
title={On the Barzilai-Borwein method},
author={Fletcher, Roger},
journal={Optimization and control with applications},
year={2005},
publisher={Springer}
}
@article{dai2005asymptotic,
title={On the asymptotic behaviour of some new gradient methods},
author={Dai, Yu-Hong and Fletcher, Roger},
journal={Mathematical Programming},
year={2005},
publisher={Springer},
url={https://link.springer.com/content/pdf/10.1007%2Fs10107-004-0516-9.pdf},
}
@article{Rutishauser1959,
author="Rutishauser, H.",
title="Theory of Gradient Methods",
journal="Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-Adjoint Boundary Value Problems",
year="1959",
url="https://doi.org/10.1007/978-3-0348-7224-9_2"
}
@article{hardt2013zen,
author="Moritz Hardt",
title="The zen of gradient descent",
journal="Moody Rd (blog)",
year="2013",
url="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html"
}
@book{golub2012matrix,
title={Matrix computations},
author={Golub, Gene H and Van Loan, Charles F},
year={2013},
journal={Hohns Hopkins University Press}
}
@article{yuan2008step,
title={Step-sizes for the gradient method},
author={Yuan, Ya-xiang},
journal={AMS IP Studies in Advanced Mathematics},
volume={42},
number={2},
pages={785},
year={2008},
publisher={Providence, RI; American Mathematical Society; 1999},
url={ftp://lsec.cc.ac.cn/pub/yyx/papers/p0504.pdf}
}
@article{flanders1950numerical,
title={Numerical determination of fundamental modes},
author={Flanders, Donald A and Shortley, George},
journal={Journal of Applied Physics},
year={1950},
url={https://doi.org/10.1063/1.1699598},
}
@article{nemirovski1995information,
title={Information-based complexity of convex programming},
author={Nemirovski, Arkadi},
journal={Lecture Notes (Lecture 12)},
year={1995},
url={https://www2.isye.gatech.edu/~nemirovs/Lec_EMCO.pdf}
}
@article{bauer2011my,
title={My years with Rutishauser},
author={Bauer, Friedrich L},
journal={Informatik-Spektrum},
year={2011},
publisher={Springer},
url={https://sci-hub.tw/10.1007/s00287-011-0554-7}
}
@article{gutknecht1910numerical,
title={Numerical analysis in Zurich--50 years ago},
author={Gutknecht, Martin H},
journal={Schweizerische Mathematische Gesellschaft},
year={2010},
url={http://www.sam.math.ethz.ch/~mhg/pub/mhg-published/notes8-Gut10-NAZ-SMG100.pdf}
}
@article{cauchy1847methode,
title={Méthode générale pour la résolution des systèmes d'équations simultanées},
author={Cauchy, Augustin},
journal={Comp. Rend. Sci. Paris},
volume={25},
number={1847},
pages={536--538},
year={1847},
url={https://gallica.bnf.fr/ark:/12148/bpt6k90190w/f406}
}
@article{lemarechal2012cauchy,
title={Cauchy and the gradient method},
author={Lemaréchal, Claude},
journal={Doc Math Extra},
year={2012},
url={https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf}
}
@incollection{scieur2016regularized,
title = {Regularized Nonlinear Acceleration},
author = {Scieur, Damien and d'Aspremont, Alexandre and Bach, Francis},
journal = {Advances in Neural Information Processing Systems 29},
year = {2016},
url = {https://arxiv.org/pdf/1606.04133.pdf}
}
@book{fischer1996polynomial,
title={Polynomial based iteration methods for symmetric linear systems},
author={Fischer, Bernd},
year={1996},
url={https://doi.org/10.1007/978-3-663-11108-5},
journal={Springer}
}
@book{saad2003iterative,
title={Iterative methods for sparse linear systems},
author={Saad, Yousef},
volume={82},
year={2003},
journal={SIAM},
url={https://www-users.cs.umn.edu/~saad/IterMethBook_2ndEd.pdf}
}
@article{lanczos1950iteration,
title={An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators1},
author={Lanczos, Cornelius},
journal={Journal of Research of the National Bureau of Standards},
year={1950},
url={http://www.cs.umd.edu/~oleary/lanczos1950.pdf}
}
@article{lanczos1952solution,
title={Solution of systems of linear equations by minimized iterations},
author={Lanczos, Cornelius},
journal={J. Res. Nat. Bur. Standards},
url={https://nvlpubs.nist.gov/nistpubs/jres/049/1/V49.N01.A06.pdf},
year={1952}
}
@article{hestenes1952methods,
title={Methods of conjugate gradients for solving linear systems},
author={Hestenes, Magnus R and Stiefel, Eduard},
journal={Journal of research of the National Bureau of Standards},
year={1952},
url={https://pdfs.semanticscholar.org/466d/addfb6340c28cb8da548007028c8cc5df687.pdf}
}
@article{pedregosa2020average,
title={Average-case Acceleration Through Spectral Density Estimation},
author={Pedregosa, Fabian and Scieur, Damien},
journal={Proceedings of the 37th International Conference on Machine Learning (ICML)},
year={2020},
url={https://arxiv.org/pdf/2002.04756.pdf}
}
@article{scieur2020universal,
title={Universal Average-Case Optimality of Polyak Momentum},
author={Scieur, Damien and Pedregosa, Fabian},
journal={Proceedings of the 37th International Conference on Machine Learning (ICML)},
year={2020},
url={https://arxiv.org/pdf/2002.04664.pdf}
}
@article{polyak1964some,
title={Some methods of speeding up the convergence of iteration methods},
author={Polyak, Boris T},
journal={USSR Computational Mathematics and Mathematical Physics},
year={1964},
url={https://doi.org/10.1016/0041-5553(64)90137-5}
}
@article{polyak1987introduction,
title={Introduction to Optimization},
author={Polyak, Boris T},
journal={Optimization Software, Inc. Publications Division, New York},
url={https://b-ok.cc/book/2461679/c8b7e4},
year={1987}
}
@article{bach2019polynomial,
title={Polynomial magic I: Chebyshev polynomials},
author={Bach, Francis},
url={https://francisbach.com/chebyshev-polynomials/},
journal={Blog post},
year={2019}
}
@article{markoff1916polynome,
title={Über Polynome, die in einem gegebenen Intervalle möglichst wenig von Null abweichen},
author={Markov, Vladimir},
journal={Mathematische Annalen},
year={1916},
openaccess={https://sci-hub.se/10.1007/BF01456902},
url={https://doi.org/10.1007/BF01456902}
}
@article{berthier2020accelerated,
title={Accelerated Gossip in Networks of Given Dimension using Jacobi Polynomial Iterations},
author={Berthier, Raphaël and Bach, Francis and Gaillard, Pierre},
journal={SIAM Journal on Mathematics of Data Science},
year={2020},
url={https://doi.org/10.1137/19M1244822},
publisher={SIAM}
}
@article{arjevani2016lower,
title={On lower and upper bounds in smooth and strongly convex optimization},
author={Arjevani, Yossi and Shalev-Shwartz, Shai and Shamir, Ohad},
journal={The Journal of Machine Learning Research},
url={http://www.jmlr.org/papers/volume17/15-106/15-106.pdf},
year={2016},
}
@article{lacotte2020optimal,
title={Optimal Randomized First-Order Methods for Least-Squares Problems},
author={Lacotte, Jonathan and Pilanci, Mert},
journal={Proceedings of the 37th International Conference on Machine Learning (ICML)},
year={2020},
url={https://arxiv.org/pdf/2002.09488.pdf}
}
@article{chebyshev1853theorie,
title={Théorie des mécanismes connus sous le nom de parallélogrammes},
author={Chebyshev, Pafnuti},
year={1853},
journal={Imprimerie de l'Académie impériale des sciences},
url={https://books.google.ca/books?id=IOrnAAAAMAAJ&lpg=PA537&ots=qK0s3q3LTN&dq=Th%C3%A9orie%20des%20m%C3%A9canismes%20connus%20sous%20le%20nom%20de%20parall%C3%A9logrammes&pg=PA538#v=onepage&q&f=false}
}
@article{frankel1950convergence,
title={Convergence rates of iterative treatments of partial differential equations},
author={Frankel, Stanley P},
journal={Mathematical Tables and Other Aids to Computation},
volume={4},
number={30},
pages={65--75},
year={1950},
publisher={JSTOR}
}
@article{hochstrasser1954anwendung,
title={Die Anwendung der Methode der konjugierten Gradienten und ihrer Modifikationen auf die Lösung linearer Randwertprobleme},
author={Hochstrasser, Urs},
year={1954},
journal={Doctoral Thesis (in German), ETH Zurich},
url={https://doi.org/10.3929/ethz-a-000091966}
}
@article{shortley1953use,
title={Use of Tschebyscheff-Polynomial Operators in the Numerical Solution of Boundary-Value Problems},
author={Shortley, George},
journal={Journal of Applied Physics},
year={1953},
publisher={American Institute of Physics},
url={https://doi.org/10.1063/1.1721292}
}
@article{young1953richardson,
title={On Richardson's method for solving linear systems with positive definite matrices},
author={Young, David},
journal={Journal of Mathematics and Physics},
year={1953},
publisher={Wiley Online Library},
url={https://doi.org/10.1002/sapm1953321243},
}
@article{varga1957comparison,
title={A comparison of the successive overrelaxation method and semi-iterative methods using Chebyshev polynomials},
author={Varga, Richard S},
journal={Journal of the Society for Industrial and Applied Mathematics},
year={1957},
url={https://doi.org/10.1137/0105004},
publisher={SIAM}
}
@article{golub1961chebyshev,
title={Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods},
author={Golub, Gene H and Varga, Richard S},
year={1961},
journal={Numerische Mathematik},
url={https://doi.org/10.1007/BF01386013}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\HH{\boldsymbol a}
\def\rr{\boldsymbol r}
\def\HH{\boldsymbol A}
\def\HH{\boldsymbol H}
\def\EE{\mathbb E}
\def\II{\boldsymbol I}
\def\CC{\boldsymbol C}
\def\DD{\boldsymbol D}
\def\KK{\boldsymbol K}
\def\eeps{\boldsymbol \varepsilon}
\def\tr{\text{tr}}
\def\LLambda{\boldsymbol \Lambda}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\qq{\boldsymbol q}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\lmax{L}
\def\lmin{\mu}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\QQ{\boldsymbol Q}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\DeclareMathOperator{\span}{\mathbf{span}}
\def\defas{\stackrel{\text{def}}{=}}
\def\dif{\mathop{}\!\mathrm{d}}
$$
</div>
<h2>Introduction</h2>
<p>
This post deals with a connection between optimization algorithms and polynomials. The problem that we will be looking at throughout the post is that of finding a vector $\xx^\star \in \RR^d$ that minimizes the convex
quadratic objective
\begin{equation}\label{eq:opt}
f(\xx) \defas \frac{1}{2}\xx^\top \HH \xx + \bb^\top \xx~,
\end{equation}
where $\HH$ is a symmetric positive definite matrix.<dt-note>The quadratic assumption is made for
the purpose of the analysis. All the discussed algorithms are applicable to any smooth (not
necessarily quadratic) function.
</dt-note>
</p>
<p>
A popular class of algorithms for this problem are <i>gradient-based methods</i>. These are
methods in which the next iterate is a linear combination of the previous iterate and past
gradients. In the next section we will see how we can identify any gradient-based method with a
polynomial that determines the method's convergence speed. And vice-versa, we will see in
section 3 that it's possible to do the inverse path and go from a (subset of) polynomials to
optimization methods. Then we will use this connection to derive the Chebyshev iterative method,
which forms the basis of some of the most used optimization methods such as Polyak momentum,
which will be the topic of the next blog post.
</p>
<h3>Pioneers of Scientific Computing</h3>
<p>
The ideas outlined in this post can be traced back to the early years of numerical analysis.
Minimizing quadratic objective functions (or equivalently solving linear systems of equations)
was one of the first applications of scientific computing,
and while some methods like gradient descent (Cauchy, 1847)<dt-cite key="cauchy1847methode">
</dt-cite>
<dt-cite key="lemarechal2012cauchy"></dt-cite> and <a href="https://en.wikipedia.org/wiki/Gauss%E2%80%93Seidel_method#cite_note-1">Gauss-Seidel</a>
predate the development of electronic computers, their algorithmic analysis as we know it today
emerged during the 1940s and 1950s.
</p>
<p>
One of the first applications of the theory of polynomials to the solution of linear systems is
the development of the Chebyshev iterative method, done independently by Flanders and Shortley,<dt-cite key="flanders1950numerical"></dt-cite> Cornelius Lanczos,<dt-cite key="lanczos1952solution"></dt-cite> and David Young,<dt-cite key="young1953richardson">
</dt-cite> and further analyzed and popularized by Richard Varga<dt-cite key="varga1957comparison"></dt-cite> and Gene Golub.<dt-cite key="golub1961chebyshev">
</dt-cite>
Shortly after the development of the Chebyshev iterative method, the field saw one of the most
beautiful applications of the theory of orthogonal polynomials in numerical analysis, the
conjugate gradient method.<dt-cite key="hestenes1952methods"></dt-cite>
</p>
<p>
An excellent review on the topic is
the 1959 book chapter “Theory of Gradient Methods” by H. Rutishauser,<dt-cite key="Rutishauser1959"></dt-cite>
<dt-note><img width="50%" src="/images/2019/1960_rutishauser.jpg" alt="" style="display: block; margin: 0 auto; max-width: 150px; box-shadow: 6px 6px 3px grey;"><br /> <a href="https://en.wikipedia.org/wiki/Heinz_Rutishauser">Heinz Rutishauser</a> (1918–1970).
Among other contributions, he introduced a number of basic syntactic features to programming,
notably the keyword "for" for a for loop, first as the German "für" in Superplan, next via its
English translation "for" in ALGOL 58A. </dt-note> which has stood remarkably well the test of
time. A more modern but equally excellent review is the 1996 book by Bernd Fisher.<dt-cite key="fischer1996polynomial"></dt-cite>
</p>
<p>
These ideas are still relevant today. Just in the last year, these tools have been used for
example to develop accelerated decentralized algorithms,<dt-cite key="berthier2020accelerated">
</dt-cite>
<dt-cite key="bach2019polynomial"></dt-cite> to derive lower bounds<dt-cite key="arjevani2016lower"></dt-cite> and to develop methods that are optimal for the
average-case.<dt-cite key="pedregosa2020average"></dt-cite>
<dt-cite key="lacotte2020optimal"></dt-cite>
</p>
<h2>From Optimization to Polynomials</h2>
<p>
In this section we will develop a method to assign to every optimization
method a polynomial that determines its convergence. We will first motivate this approach
through gradient descent and then generalize it to other methods.
</p>
<h3>Motivation: Gradient Descent</h3>
<p>
Consider the gradient descent method that generates iterates following
\begin{equation}
\xx_{t+1} = \xx_t - \tfrac{2}{L + \lmin} \nabla f(\xx_t)~,
\end{equation}
where $L$ and $\lmin$ are the largest and smallest eigenvalue of $\HH$ respectively.
Our goal is to derive a bound on the error $\|\xx_{t+1} - \xx^\star\|$ as a function of the
number of iterations $t$ and the spectral properties of $\HH$.
</p>
<p>
By the first order optimality conditions, at the optimum we have $\nabla f(\xx^\star) = 0$ and
so $\HH \xx^\star = - \bb$ by the quadratic assumption. We can use this to write the gradient as $\nabla f(\xx_t) = \HH
(\xx_t - \xx^\star)$.
Subtracting $\xx^\star$ from both sides of the above equation we have
\begin{align}
&\xx_{t+1} - \xx^\star = \xx_t - \tfrac{2}{L + \lmin} \HH (\xx_t - \xx^\star) - \xx^\star\\
&\quad= (\II - \tfrac{2}{L + \lmin} \HH)(\xx_t - \xx^\star)\\
&\quad= (\II - \tfrac{2}{L + \lmin} \HH)^2(\xx_{t-1} - \xx^\star) = \cdots \\
&\quad= (\II - \tfrac{2}{L + \lmin} \HH)^{t+1} (\xx_0 - \xx^\star)
\end{align}
We now have an expression for $\xx_{t+1} - \xx^\star$ in terms of the initial conditions $\xx_0
- \xx^\star$, and a polynomial of degree $t$ in $\HH$.
Taking the 2-norm on both sides and using Cauchy-Schwartz we have
\begin{align}
\|\xx_{t+1} - \xx^\star\|_2 &\leq \|(\II - \tfrac{2}{L + \lmin}\HH)^{t+1}\|_2 \|\xx_0 - \xx^\star\|_2\label{eq:gd_cauchy_schwartz}\\
&\leq \max_{\lambda \in [\lmin, L]}\, |(1 - \tfrac{2}{L + \lmin} \lambda)^{t+1}| \|\xx_0 -
\xx^\star\|_2
\label{eq:gd_respol_bound}
\end{align}
where the second inequality follows by the definition matrix 2-norm of a symmetric matrix. In this bound, convergence
is determined by the maximum absolute value of the polynomial
\begin{equation}
P^{\text{GD}}_{t+1} \defas \left(1 - \tfrac{2}{L + \lmin} \lambda\right)^{t+1}~.
\end{equation}
Here's a plot of this polynomial, together with its maximum absolute value represented
with a dashed line:
</p>
<figure>
<span class="marginnote"><br><br><br>The residual polynomial for gradient descent,
$P_t(\lambda)
= (1 - 2 \lambda / (\lmax + \lmin))$.
Only even degrees are displayed for visualization purposes. <br />
<a href="https://colab.research.google.com/gist/fabianp/98b052553d5fc50c7c2d099360bb2df5/polynomials_acceleration.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/gd_residual_polynomial.png">
</figure>
<p>
As expected, as $t$ increases, the polynomial goes to zero in the interval $[\lmin, L]$.
Furthermore, the largest absolute value of the polynomial is achieved at the edges, and we can use this
to bound its maximum value in \eqref{eq:gd_respol_bound}.
This leads to the following classical bound on the error:
\begin{equation}
\|\xx_t - \xx^\star\| \leq \left(\frac{\lmax - \lmin}{\lmax + \lmin} \right)^{t}\|\xx_0 -
\xx^\star\|~.
\end{equation}
</p>
<h3 id="gen-gradient">Generalization to Gradient-based Methods</h3>
<p>
We will now show how the previous approach generalizes to any gradient-based method.
These are methods where the update is a linear
combination of the current iterate, current gradient and difference of previous iterates
\begin{equation}\label{eq:gradient_based}
\xx_{t+1} = \xx_{t} + \sum_{i=0}^{t-1} c^{(t)}_{i} (\xx_{i+1}\!-\!\xx_{i}) + c^{(t)}_{t} \nabla
f(\xx_{t})~,
\end{equation}
for some scalar values $c^{(t)}_{j}$.<dt-note>This is a large class of methods that include gradient descent, (Nesterov) momentum but not methods with matrix preconditioners like BFGS since the coefficients are required to be scalars.</dt-note> To this method, we will associate the following
<i>residual
polynomial</i> $P_t$, which is a polynomial of degree $t$ defined recursively as
\begin{equation}
\begin{split}
P_{t+1}(\lambda) &= (1 + c^{(t)}_{t} \lambda)P_{t}(\lambda) + \sum_{i=0}^{t-1} c^{(t)}_{i}(P_{i+1}(\lambda)\!-\!P_{i}(\lambda))\\
P_0(\lambda) &= 1~.
\end{split}\label{eq:def_residual_polynomial}
\end{equation}
<!-- Note: label needs to be at end of equation because of https://github.com/mathjax/MathJax/issues/1020 -->
</p>
<p class="remark">
By construction, all residual polynomials verify $P_t(0) = 1$.
</p>
<p>
The following lemma shows how the residual polynomial can be useful in the analysis of
optimization methods. In particular, it shows how to express the error ($\xx_t - \xx^\star$) in
terms of the residual polynomial:<dt-note>It's rare in optimization to have an <i>exact</i> expression for the complexity of an algorithm, and not just a potentially loose upper bound. This is one of those cases.</dt-note>
</p>
<p class="lemma" text="Residual polynomial" id="lemma_residual_polynomial"> For any gradient-based
method and any iteration $t$, we have
\begin{equation}
\xx_t - \xx^\star = P_t(\HH)(\xx_0 - \xx^\star) \label{eq:residual_error}~.
\end{equation}
</p>
<div class="proof">
<p>
We will prove this by induction. For $t=0$, we have $P_0(\HH) = \boldsymbol{I}$ since $P_0$ is
a polynomial of
degree zero and so $\xx_0 - \xx^\star = \II(\xx_0 - \xx^\star)$ is clearly true.
</p>
<p>
Suppose now it's true for $t$. We will show that this implies it's true for $t+1$.
By definition of $\xx_{t+1}$ in Eq. \eqref{eq:gradient_based} we have:
\begin{align*}
&\xx_{t+1} - \xx^\star = \xx_t - \xx^\star + \sum_{i=0}^{t-1} c^{(t)}_{i} (\xx_{i+1} - \xx_{i})+ c_t^{(t)}\nabla
f(\xx_{t}) \\
&= (\II + c^{(t)}_{t} \HH)(\xx_t - \xx^\star) + \sum_{i=0}^{t-1} c^{(t)}_{i} (\xx_{i+1} - \xx_{i})\\
&= (\II + c^{(t)}_{t} \HH)P_t(\HH)(\xx_0 - \xx^\star) + \sum_{i=0}^{t-1} c^{(t)}_{i} (P_{i+1}(\HH)
-P_{i}(\HH))(\xx_0 - \xx^\star)\\
& = P_{t+1}(\HH)(\xx_0 - \xx^\star)\label{eq:follows_def_pt}
\end{align*}
where the second identity follows by the induction hypothesis and the last one by definition
of the residual polynomial in \eqref{eq:def_residual_polynomial}.
</p>
</div>
<!-- <p class="remark">
The above proof relies crucially in the identity $\nabla f(\xx_t) = \HH (\xx_t - \xx^\star)$, which in turn relies on the quadratic assumption. The rest of the results use the quadratic assumption indirectly through this lemma and the next corollary.
</p> -->
<p>
Given its simplicity, it's easy to glance upon this lemma, so let's take a minute to appreciate its significance. It says that <i>for any gradient-based method</i> I can express the error $\xx_t - \xx^\star$ <i>exactly</i> as a product of two terms. One term is the error at initialization $\xx_0 - \xx^\star$ and doesn't depend on the optimization method. The other term instead, $P_t(\HH)$, doesn't depend on the initialization but does depend on the optimization method. Since we only have control over the optimization method, this lemma is telling us that the efficiency of any optimization method can be studied exclusively in terms of its associated residual polynomial $P_t$.
</p>
<p>
From this lemma we can also derive a bound that replaces the matrix
polynomial by a scalar bound, which is often more interpretable.
For this, consider the <a href="https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Eigendecomposition_of_a_matrix">eigendecomposition</a>
of $\HH = \QQ \LLambda \QQ^T$, where $\QQ$ is orthonormal and $\LLambda$ is a diagonal matrix
with diagonal elements the eigenvalues of $\HH$. Then taking norms in \eqref{eq:residual_error}
and using this decomposition we have
\begin{align}
\|\xx_t - \xx^\star\| &= \|P_t(\HH) (\xx_0 - \xx^\star)\|\\
&\leq \|P_t(\HH)\|\|\xx_0 - \xx^\star\|\quad \text{ (Cauchy-Schwartz)}\\
&= \|P_t(\LLambda)\|\|\xx_0 - \xx^\star\| \quad \text{ (eigendecomposition)}
\end{align}
Finally, since $P_t(\LLambda)$ is a diagonal matrix, its operator norm is given by its largest
element. This leads to the following bound:
</p>
<p class="corollary" text="Convergence rate"> Let $\lmin$ and $L$ be the smallest and largest
eigenvalue of $\HH$ respectively. Then for any gradient-based method with residual polynomial
$P_t$, we have
\begin{equation}
\|\xx_t - \xx^\star\| \leq {\color{purple}\underbrace{\max_{\lambda \in [\lmin, L]
}}_{\text{conditioning}}} \,
{\color{teal}\underbrace{\vphantom{\max_{[L]}}|P_t(\lambda)|}_{\text{algorithm}} } \,\,
{\color{brown}\underbrace{\vphantom{\max_{[L]}}\|\xx_0 - \xx^\star\|}_{\text{initialization}}}~.
\end{equation}
</p>
<p>
This Corollary will be very useful in the following as it constructs a bound on the distance to
optimum based on the three aspects that most influence convergence:
<ul>
<li>
The <span style="color: teal">algorithm</span> enters this bound through its residual
polynomial. The smaller image of the residual polynomial, the better. Of course, we have the
constraint $P_t(0) = 1$, which makes choosing this polynomial non-trivial.
</li>
<li>The <span style="color: purple">conditioning of the problem</span> enters through the
eigenvalue interval $[\lmin, L]$.
<!-- The further apart $\lmin$ from $L$, the harder the problem. -->
</li>
<li>
The <span style="color: brown">initialization</span> enters through its distance to optimum.
</li>
</ul>
</p>
<p>
Of the quantities in the above lemma, $\max_{\lambda \in [\lmin, L]} |P_t(\lambda)|$ is an upper bound on the progress that the algorithm makes at iteration $t$. This quantity will be useful in the
following to compare the convergence properties of different algorithm. We will refer to it as
the (worst-case) <i>convergence rate</i> of a method.
</p>
<p class="remark" text="Worst-case vs average-case">
All the convergence rates in this post are worst-case bounds. By this I mean that the bounds
hold for <i>any</i> problem within a class (quadratic problems of the form \eqref{eq:opt} in
this post). However, for certain problems the empirical convergence often is better than the
theoretical bound. Furthermore, these bounds don't say anything about the <span style="font-style: normal">typical</span> performance of a method, which is more
representative of the actual performance of a method. In a recent work, <a href="https://damienscieur.com/">Damien</a> and I explored the average-case complexity of
optimization algorithms, first in the non-asymptotic regime<dt-cite key="pedregosa2020average">
</dt-cite> and then in the asymptotic one.<dt-cite key="scieur2020universal"></dt-cite> The
average-case analysis of optimization algorithms will be the subject of a future post.
</p>
<h2>Optimal Residual Polynomial</h2>
<p>
In the previous section we've seen that the convergence rate of gradient descent is
$\left(\frac{\lmax - \lmin}{\lmax + \lmin} \right)^{t}$. <i>Is this the optimal convergence rate
or can we do better?</i>
</p>
<p>
To answer these questions we will use a somewhat indirect approach. Instead of seeking directly
the optimal method, we will instead seek the optimal <i>residual polynomial</i>, and then
reverse-engineer the method from its polynomial.
This technique turns out to be more appropriate since it will allow us to use powerful results from the polynomial approximation literature.
</p>
<p>
By optimal in this case I mean the polynomial with smallest convergence rate $\max_{\lambda \in [\lmin, L]}|P(\lambda)|$. This corresponds
to solving the following problem over the space of polynomials
\begin{equation}
\argmin_{P}\max_{\lambda \in [\lmin, L]}|P(\lambda)|~,
\end{equation}
where the minimization is over all degree $t$ residual polynomials.
The normalization $P(0) = 1$ of residual polynomials makes this problem non-trivial, as
otherwise the optimal solution would be $P(\lambda) = 0$.
</p>
<p>
The residual polynomial that minimizes the above expression turns out to be intimately related
to <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials">Chebyshev polynomials</a>,
which are one of the most used families of polynomials in numerical analysis.<dt-cite key="chebyshev1853theorie"></dt-cite>
<dt-note><a href="https://en.wikipedia.org/wiki/Pafnuty_Chebyshev#/media/File:Pafnuty_Lvovich_Chebyshev.jpg"><img style="display: block; margin: 0 auto; max-width: 200px; box-shadow: 6px 6px 3px grey;" src="https://upload.wikimedia.org/wikipedia/commons/d/d3/Pafnuty_Lvovich_Chebyshev.jpg" alt=""></a> <br> Chebyshev polynomials bear the name of <a href="https://en.wikipedia.org/wiki/Pafnuty_Chebyshev">Pafnuty Chebyshev</a> (1821-1894). He
is considered to be a founding father of Russian mathematics and made important contributions to the fields of probability (<a href="https://en.wikipedia.org/wiki/Chebyshev%27s_inequality">Chebyshev inequality</a>),
statistics, mechanics and number theory (<a href="https://en.wikipedia.org/wiki/Bertrand%27s_postulate">Bertrand-Chebyshev theorem</a>) among other.
</dt-note>
</p>
<p id="def-chebyshev" class="definition" text="Chebyshev polynomials"> The Chebyshev polynomials of the first kind
$T_0, T_1, \ldots$ are defined by the recurrence relation
\begin{align}
&T_0(\lambda) = 1 \qquad T_1(\lambda) = \lambda\\
&T_{k+1}(\lambda) = 2 \lambda T_k(\lambda) - T_{k-1}(\lambda)~.
\end{align}
</p>
<p>
Among the many properties of Chebyshev polynomials is that a shifted and normalized version of
it achieves the optimal convergence rate.
Most best polynomial approximation results assume the polynomials are defined in the interval
$[-1, 1]$. However, in our case it's more natural to focus on the interval $[\lmin, L]$ that contains the Hessian eigenvalues. To account for this discrepancy, we'll use the following affine mapping
\begin{equation}
\sigma(\lambda) = \frac{L+\lmin}{L-\lmin} - \frac{2}{L - \lmin}\lambda\,,
\end{equation}
that maps the interval $[\lmin, L]$ onto $[-1, 1]$ with $\sigma(\lmin) = 1$ and $\sigma(L) = -1$.
</p>
<p>
We can now introduce a theorem that relates the convergence rate optimality to
Chebyshev polynomials. It was first proven by Vladimir Markov<dt-cite key="markoff1916polynome">
</dt-cite>
<dt-note><a href="https://en.wikipedia.org/wiki/Vladimir_Markov_(mathematician)">Vladimir
Markov</a> was a student of Chebyshev and brother of the famous <a href="https://en.wikipedia.org/wiki/Andrey_Markov">Andrey Markov</a>. Vladimir died of
tuberculosis when he was only 25 year old.</dt-note> and later rediscovered by Flanders and
Shortley.<dt-cite key="flanders1950numerical"></dt-cite>
</p>
<p class="theorem" text="Markov, 1916">
The following shifted Chebyshev polynomial has smallest maximum absolute value through the
interval
$[\lmin, \lmax]$ (where $\lmin > 0$) among all residual polynomials of degree $t$:
\begin{equation}\label{eq:def_residual_chebyshev}
P^{\text{cheb}}_t(\lambda) \defas \frac{T_t(\sigma(\lambda))}{T_t(\sigma(0))}~.
\end{equation}
We will refer to this polynomial as the <i>residual Chebyshev polynomial.</i>
</p>
<div class="proof">
<p>
It's <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials#Roots_and_extrema">known</a>
that the Chebyshev polynomial $T_t$ has $k+1$ extreme points in the interval $[-1, 1]$, and
the image of these extremal points is alternately positive and negative.
Because of the definition of $P_t$ above, the numerator is a
linear translation of $[-1, 1]$ into $[\lmin, L]$ and so $P_t$ reaches its extreme points $t+1$
times in the interval $[\lmin, L]$, and again the image at these extremal points is alternately
positive and negative. Now suppose that $R_t$ is a residual polynomial with the same degree as
$P_t$ but with smaller maximum absolute value and let $Q(\lambda) \defas P_t(\lambda) -
R_t(\lambda)$.
</p>
<p>
Since by assumption $R$ has smaller maximum absolute value than $P_k$, $Q$ is alternatedly $>
0$ and $< 0$ at the extremal points of $P_t$. From the intermediate value theorem, this
implies that $Q$ has $t$ zeros in the interval $[\lmin, L]$. However, since both $P_t$ and
$R_t$ are residual polynomials, we also have $Q(0)=P_t(0) - R_t(0)=0$ and so $Q$ would have
$t+1$ zeros in the interval $[0, L]$. Since $Q$ is a polynomial of degree $t$ and non-zero
by assumption, this cannot be true. We reached a contradiction induced by the assumption
that $R_t$ had smaller maximum absolute value, which proves the theorem. </p>
</div>
<p>
As we did for gradient descent, I find it interesting to visualize the residual polynomials.
Below is the Chebyshev and gradient descent
residual polynomial, together with their maximum absolute value ($\equiv$convergence rate) shown
in dashed
lines.
</p>
<figure>
<span class="marginnote">Gradient descent and Chebyshev residual polynomials <br><br>
<a href="https://colab.research.google.com/gist/fabianp/98b052553d5fc50c7c2d099360bb2df5/polynomials_acceleration.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/cheb_residual_polynomial.png">
</figure>
<p>
As expected from the theory, the maximum of the Chebyshev polynomial (dashed orange line) is
significatively smaller than that of the gradient descent residual polynomial, which translates
into a faster convergence rate. We can also see from this plot that the Chebyshev residual
polynomial, unlike gradient descent, reaches its extremal value not just once in the edges, but
$t+1$ times, a fact that was crucial argument in proving Markov's Theorem.
</p>
<h2>The Chebyshev iterative method</h2>
<p>
From the previous theorem we know that the Chebyshev residual polynomials are optimal in terms
of their worst-case convergence rate. We will now derive the method associated with this
polynomial. We will do this by deriving the recurrence of the residual polynomial, and then
using the definition of residual polynomial to match the method's coefficients.
</p>
<p>
Let $a = \sigma(0), b = \sigma(\lambda)$. Then using the three term recurrence of the Chebyshev
polynomials we have
\begin{align}
P^{\text{cheb}}_{t+1}(\lambda) &= \frac{T_{t+1}(b)}{T_{t+1}(a)} = 2 b \frac{T_t(b)}{T_{t+1}(a)}
- \frac{T_{t-1}(b)}{T_{t+1}(a)}\\
&= 2 b \frac{T_t(a)}{T_{t+1}(a)} P^{\text{cheb}}_t(\lambda) - \frac{T_{t-1}(a)}{T_{t+1}(a)}
P^{\text{cheb}}_{t-1}(\lambda)~,
\end{align}
where in the second identity we have multiplied and divided the first term by $T_t(a)$ and the
second term by $T_{t-1}(a)$
</p>
<p>
Now let $\omega_{t} \defas 2 a\frac{T_t(a)}{ T_{t+1}(a)}$. Then we can rewrite the previous
equation as
\begin{equation}
P^{\text{cheb}}_{t+1}(\lambda) = \omega_t \frac{b}{a} P^{\text{cheb}}_t(\lambda) - \frac{1}{4
a^2} \omega_t \omega_{t-1} P^{\text{cheb}}_{t-1}(\lambda)\,.
\end{equation}
We would now like to have a cheap way to compute $\omega_t$. Using again the three term recurrence of
Chebyshev polynomials on $\omega_{t+1}^{-1}$ we have
\begin{align}
\omega_{t}^{-1} &= \frac{T_{t+1}(a)}{2 a T_{t}(a)} = \frac{2 a T_t(a) - T_{t-1}(a)}{2 a
T_{t}(a)} = 1 - \frac{\omega_{t-1}}{4 a^2}
\end{align}
and so the full recursion for the residual polynomial becomes
\begin{align}
&P_0(\lambda) = 1\,,~P_1(\lambda) = 1 - \tfrac{2}{L + \lmin}\lambda\,\\
&\omega_{t} = \left(1 - \tfrac{1}{4 a^2}\omega_{t-1} \right)^{-1} \quad \text{ (with $\omega_0 =
2$)}\\
&P^{\text{cheb}}_{t+1}(\lambda) = \omega_{t} \tfrac{b}{a} P^{\text{cheb}}_t(\lambda) -
\tfrac{1}{4 a^2} \omega_{t} \omega_{t-1} P^{\text{cheb}}_{t-1}(\lambda)
\end{align}
where $\omega_0 = 2$ is computed from the definition of Chebyshev polynomials, $\omega_{0} = 2a T_{0}(a) /
T_1(a) = 2 a 1 / a = 2$.
</p>
<p>
This gives a recurrence for the residual polynomial.
We can then use the connection between optimization methods and residual polynomials
\eqref{eq:def_residual_polynomial}, this time in reverse, to find the coefficients of the
optimization method from its residual polynomial. Matching terms in $P_t$ in the previous
equation we have
\begin{equation}
1 + \lambda c^{(t)}_{t} + c^{(t)}_{t-1} = \omega_{t}\tfrac{b}{a} = \omega_{t}(1 - \tfrac{2}{L +
\lmin}\lambda)
\end{equation}
and so matching terms in $\lambda$ we get
\begin{equation}
c^{(t)}_{t} = - \tfrac{2}{L+\lmin} \omega_{t} \qquad c^{(t)}_{t-1} = \omega_{t} - 1~.
\end{equation}
Replacing these parameters in the definition of a gradient-based method, we obtain the Chebyshev iterative method:
</p>
</div>
<p class="framed">
<b class="tufte-underline">Chebyshev Iterative Method</b><br>
<b>Input</b>: starting guess $\xx_0$, $\rho = \frac{L - \lmin}{L + \lmin}$, $\omega_{0} = 2$<br>
$\xx_1 = \xx_0 - \frac{2}{L + \lmin} \nabla f(\xx_0)$ <br>
<b>For</b> $t=1, 2, \ldots~$ <b>do</b>
\begin{align}
\omega_{t} &= (1 - \tfrac{\rho^2}{4}\omega_{t-1})^{-1}\label{eq:recurrence_momentum}\\
\xx_{t+1} &= \vphantom{\sum^n}\xx_t + \left(\omega_{t} - 1\right)(\xx_{t} -
\xx_{t-1})\nonumber\\
&\quad - \omega_{t}\tfrac{2}{L + \lmin}\nabla f(\xx_t)
\end{align}
</p>
<p class="remark">
Although we initially considered a class of methods that can use <span style="font-style: normal;">all</span> previous iterates, the optimal method only requires to
store the last iterate and the difference vector $\xx_{t-1} - \xx_{t}$. This makes the method
much more practical than if we had to store all previous gradients.
</p>
<h2 id="convergence-rate">Convergence rate</h2>
<p>
By construction, the Chebyshev iterative method has an optimal worst-case convergence rate. But
what is this optimal rate?
</p>
<p>
Since the maximum absolute value of the Chebyshev polynomials in $[-1, 1]$ is 1, the convergence
rate for $P^{\text{cheb}}$ in this case is
\begin{equation}
\max_{\lambda \in [\lmin, L]} |P^{\text{cheb}}_t(\lambda)| = T_t\left(\frac{L + \lmin}{L -
\lmin}\right)^{-1}~.
\end{equation}
</p>
<p>
This is however not very helpful if we cannot evaluate the Chebyshev polynomial. Luckily,
outside the interval $[-1, 1]$ they do admit the following <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials#Explicit_expressions">explicit
expression</a>
\begin{equation}\label{eq:cheb_explicit}
T_t(\lambda) = \frac{1}{2} \big[ \big( \lambda + \sqrt{\lambda^2-1} \big)^t + \big(\lambda –
\sqrt{\lambda^2-1}\big)^t \big]\,.
\end{equation}
Setting $\lambda = \tfrac{L + \lmin}{L - \lmin}$ and using the trivial bound $\lambda >
\sqrt{\lambda^2-1}$ we have
\begin{align}
T_t\left(\tfrac{L+\lmin}{L - \lmin}\right)
&\geq \frac{1}{2}\left( \tfrac{L+\lmin}{L - \lmin} + \sqrt{\Big( \tfrac{L+\lmin}{L -
\lmin}\Big)^2-1} \right)^t \label{eq:bound_chebyshev}\\
&= \frac{1}{2}\left(\tfrac{L + \lmin + 2 \sqrt{\lmin L}}{L - \lmin} \right)^t \\
&= \frac{1}{2}\left(\tfrac{\sqrt{L} + \sqrt{\lmin}}{\sqrt{L} - \sqrt{\lmin}} \right)^t ~,
\end{align}
where last identity follows from completing the square in the numerator and using the identity
$L - \lmin = (\sqrt{L} - \sqrt{\lmin})(\sqrt{L} + \sqrt{\lmin})$ in the denominator. We have hence
computed the convergence rate associated with the Chebyshev residual polynomial:
</p>
<p class="corollary">
Let $\xx_1, \xx_2, \ldots$ be the iterates generated by the Chebyshev iterative method. Then we
have the following bound
\begin{equation}
\|\xx_t - \xx^\star\| \leq 2\left( \frac{\sqrt{L} - \sqrt{\lmin}}{\sqrt{L} + \sqrt{\lmin}}\right)^t
\|\xx_0 - \xx^\star\| \,.
\end{equation}
</p>
<p>
Note that the term $2\left( \frac{\sqrt{L} -
\sqrt{\lmin}}{\sqrt{L} + \sqrt{\lmin}}\right)^t$ is not the exact rate $\max_{\lambda \in [\lmin, L]} |P^{\text{cheb}}_t(\lambda)|$ but instead an upper bound on this quantity due to the bound we
used in \eqref{eq:bound_chebyshev}. This explains why under some choices of $L$, $\lmin$ and $t$
the rate is worse than that of gradient descent despite being the method with an optimal
convergence rate.
</p>
<p>
Despite the looseness of the bound, we can see that the square root in the $L$ and $\lmin$
constants can in some circumstances make the convergence rate much smaller, specially in cases
in which $\lmin$ is much smaller than $L$, which is a common setting, since in machine learning
$\lmin$ is often very
close to zero. Below is a comparison of the convergence rate
</p>
<figure>
<span class="marginnote">Comparison of convergence rates bounds for gradient descent and
Chebyshev.<br />
<a href="https://colab.research.google.com/gist/fabianp/98b052553d5fc50c7c2d099360bb2df5/polynomials_acceleration.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></span>
<img src="/images/2020/convergence_rate.png">
</figure>
<h2>Citing</h2>
<p>
If you find this blog post useful, please consider citing it as:
</p>
<p style="margin-left: 5%">
<a href="http://fa.bianp.net/blog/2020/polyopt/">Residual Polynomials and the Chebyshev method.</a>, Fabian Pedregosa, 2020
<p>
<p>
Bibtex entry:
</p>
<pre>
<code>
@misc{pedregosa2021residual,
title={Residual Polynomials and the Chebyshev method},
author={Pedregosa, Fabian},
howpublished = {\url{http://fa.bianp.net/blog/2020/polyopt/}},
year={2020}
}
</code>
</pre>
<h2>Conclusion</h2>
<p>
In this post we've seen how to assign a polynomial to any gradient-based optimization
method, how to use this polynomial to obtain convergence guarantees and how to use it to
derive optimal methods. Using this framework we've derived the Chebyshev iterative method.
</p>
<p>
In the next posts, I will relate Polyak momentum (aka HeavyBall) to this framework.
</p>
<p>
<b>Thanks</b> to <a href="https://scholar.google.com/citations?user=93PAG2AAAAAJ&hl=en">Baptiste Goujaud</a> for detailed feedback and reporting many typos, <a href="https://damienscieur.com/">Damien Scieur</a> —polynomial wizzard
and amazing collaborator—,
<a href="http://nicolas.le-roux.name/">Nicolas
Le Roux</a>, <a href="https://cypaquette.github.io/">Courtney Paquette</a> and <a href="https://www.di.ens.fr/~ataylor/">Adrien Taylor</a> —for
encouragement and stimulating discussions—, <a href="http://francisbach.com/">Francis
Bach</a> —for pointers and <a href="http://francisbach.com/">great
writings</a> on this topic—, <a href="https://twitter.com/alirahimi0">Ali Rahimi</a> for feedback on this post, <a href="https://scholar.google.co.uk/citations?user=mvDmzAQAAAAJ&hl=en">Nicolas Loizou</a>, and
<a href="https://scholar.google.fr/citations?user=oXxTTe8AAAAJ&hl=fr">Waiss Azizian</a> for
discussions.
</p>
<p>
<b>Updates.</b> [2020/02/06] <a href="https://www.reddit.com/r/math/comments/fxu11e/on_the_link_between_polynomials_and_optimization/?utm_source=share&utm_medium=web2x">Reddit discussion</a>.
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
How to Evaluate the Logistic Loss and not NaN trying2019-09-27T00:00:00+02:002019-09-27T00:00:00+02:00<a href='http://fa.bianp.net/pages/about.html'>Fabian Pedregosa</a>tag:fa.bianp.net,2019-09-27:/blog/2019/evaluate_logistic/
<p>A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. This post takes a closer look into the source of these instabilities and discusses more robust Python implementations.
</p>
<!-- for highlighting -->
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.5/styles/default.min.css">
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.5/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<!-- Mathjax-->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath …</script>
<p>A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. This post takes a closer look into the source of these instabilities and discusses more robust Python implementations.
</p>
<!-- for highlighting -->
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.5/styles/default.min.css">
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.5/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<!-- Mathjax-->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<!-- for references -->
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{machler2012accurately,
title={Accurately Computing log(1- exp(-|a|)) Assessed by the Rmpfr package},
author={Mächler, Martin},
journal={The Comprehensive R Archive Network},
year={2012},
url={https://cran.r-project.org/web/packages/Rmpfr/vignettes/log1mexp-note.pdf}
}
@article{10.1093/imanum/draa038,
author = {Blanchard, Pierre and Higham, Desmond J and Higham, Nicholas J},
title = {Accurately computing the log-sum-exp and softmax functions},
journal = {IMA Journal of Numerical Analysis},
year = {2020},
month = {08},
issn = {0272-4979},
doi = {10.1093/imanum/draa038},
url = {https://doi.org/10.1093/imanum/draa038},
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\bb{\boldsymbol b}
\def\dd{\boldsymbol d}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\pp{\boldsymbol p}
\def\RR{\mathbb{R}}
\def\TT{\boldsymbol T}
\def\CC{\boldsymbol C}
\def\Econd{\boldsymbol E}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\def\defas{\stackrel{\text{def}}{=}}
$$
</div>
<h2>Logistic regression</h2>
<p>Consider the logistic regression loss, defined as
\begin{equation}\label{eq:logloss}
f(\xx) \defas \frac{1}{n}\sum_{i=1}^n - b_i \log(s(\aa_i^\intercal \xx)) - (1 - b_i) \log(1 - s(\aa_i^\intercal \xx))~,
\end{equation}
where each $\aa_i$ is a $d$-dimensional vector, $b_i$ is a scalar between 0 and 1,
and $s(t) \defas 1 / (1 + \exp(-t))$ is the <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid function</a>. My goal in this post will be to derive Python code that can evaluate this function as precisely as possible without sacrificing speed.<dt-note>Some references use the following slightly different formulation
\begin{equation}
\label{eq:logloss2}
\frac{1}{n}\sum_{i=1}^n \log(c_i s(\aa_i^\intercal \xx))~,
\end{equation}
where $c_i$ is a scalar that takes the value -1 when $b_i = 0$ or and 1 when $b_i = 1$.
That both formulations are equivalent is not obvious but can be proven by a distinction of cases on $b_i$.
This is the formulation that appears for example in the <a href="https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression">scikit-learn documentation</a>.
</dt-note> This function has <a href="https://en.wikipedia.org/wiki/Logistic_regression">many interpretations</a> that we will not go into here, as my main focus will be the numerical issues that arise when evaluating this expression.
</p>
<p>
Below is a straightforward implementation of the previous equation, where $A$ is the row-wise stacking of $\aa_i$:
</p>
<pre><code class="python">import numpy as np
def f_naive(x, A, b):
"""Logistic loss, naive implementation."""
z = np.dot(A, x)
tmp = 1. / (1 + np.exp(-z))
return np.mean(- b * np.log(tmp) - (1 - b) * np.log(1 - tmp))
</code></pre>
<p>
It turns out that this simple and innocent-looking implementation will overflow for moderately large values of the input.
This can be a real problem for example when training a machine learning model that uses this loss function.<dt-note>Luckily, most machine learning libraries provide more sophisticated implementations that can safeguard against this indeterminacy. I (Fabian) first saw this tricks in the <a href="http://www.jmlr.org/papers/volume9/fan08a/fan08a.pdf">liblinear package</a> around 2010, although the trick was likely known and used even before. In this library, the <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/liblinear/linear.cpp#L113">implementation of the logistic loss</a> contains a distinction of cases to safeguard from the indeterminacy. </dt-note>
For example, if we try the above code with inputs (say) $A = [[1, 1]], b=[1]$ and $x=[20, 20]$ we get the dreadful Not-A-Number (NaN):
</p>
<pre><code class="python">>>> A = [[1, 1]]
>>> b = [1]
>>> f_naive([20, 20], A, b)
RuntimeWarning: divide by zero encountered in log
RuntimeWarning: invalid value encountered in multiply
nan
</code></pre>
<p>
This happens because the two terms in 1 + np.exp(-z) are on very different scales. exp(-40) is of the order of $10^{-18}$. Since double precision can hold up to 17 significant decimal digits, the expression 1 + np.exp(-40) gets truncated to 1, a phenomenon known as <a href="https://en.wikipedia.org/wiki/Round-off_error#Roundoff_error_caused_by_floating-point_arithmetic">round-off error</a> in numerical analysis.
</p>
<p>
Second, the exponential function will overflow for large values of the input. For double precision types (float64 in <a href="https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html">NumPy jargon</a>), $e^t$ overflows for values $t \geq 710$, while for single precision types (float32) this happens for $t \geq 89$.
</p>
<h2>A closer look at the log-sigmoid</h2>
<p>
<span class="marginnote">
<img src="/images/2019/log_sigmoid.svg" alt="">
Log-sigmoid function in the interval [-5, 5].
</span>
The issues encountered before happen during the evaluation of the log-sigmoid function $\log(s(\cdot))$. Therefore we take a closer look into this function and examine its accuracy.
We will compare the following 3 different implementations:
</p>
<p>
1. <b>naive</b>. Directly evaluating $\log(s(t))$ as in the previous implementation. In code, this corresponds to
</p>
<pre><code class="python">import numpy as np
def logsig_naive(t):
return np.log(1 / (1 + np.exp(-t)))
</code></pre>
<!-- <p>
2. <b>log-expit</b> <span class="blue">TODO: remove</span>. <a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.expit.html">scipy.special.expit</a> is an implementation of the sigmoid function in SciPy. In this implementation, we evaluate $\log(\widetilde{s}(x))$, where $\widetilde{s}$ is this scipy.special.expit function.
</p>
<figure>
<pre><code class="python">import numpy as np
from scipy.special import expit
def logsig_expit(x):
return np.log(expit(x))
</code> -->
<p>
2. <b>logsumexp</b>. Scipy's <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.logsumexp.html#scipy.special.logsumexp">logsumexp</a> computes the expression
\begin{equation}
\varphi(\bb) = \log\left(\sum^{d}_{i=1} \exp(b_i)\right)~.
\end{equation}
We can reuse this function to compute the log-sigmoid through the identity
\begin{align}
\log(s(t)) &= \log(1/(1 + \exp(-t)))\\
&= -\log(1 + \exp(-t))\\
&= - \varphi([0, -t])~.
\end{align}
This suggests the following implementation:
</p>
<pre><code class="python">import numpy as np
from scipy import special
def logsig_logsumexp(t):
return - special.logsumexp([0, -t])
</code></pre>
<p>
3. <b>log1pexp</b>. <a href="https://stat.ethz.ch/~maechler/">Martin Mächler</a> wrote a note in 2012 on how to compute accurately the log-sigmoid function in the R language.<dt-cite key="machler2012accurately"></dt-cite> His technique is based on using different approximations for different values of the input and uses the <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html">np.log1p</a> function, which computes $\log(1 + x)$ for small values of $x$.
This implementation of the log-sigmoid function is
\begin{equation}
\log(s(t)) = \begin{cases}
t & t < -33.3\\
t - \exp(t) & -33.3 \leq t \leq -18 \\
-\text{log1p}(\exp(-t)) & -18 \leq t \leq 37 \\
-\exp(-t) & 37 \leq t\\
\end{cases}~.
\end{equation}
A Python implementation of this could be
</p>
<pre><code class="python">import numpy as np
def logsig_log1pexp(t):
if t < -33.3:
return t
elif t <= -18:
return t - np.exp(t)
elif t <= 37:
return -np.log1p(np.exp(-t))
else:
return -np.exp(-t)
</code></pre>
<p>
We now would like to compare the different methods. For this, we look at the relative error
\begin{equation}
\frac{y - \widehat{y}}{y}~,
\end{equation}
where $y$ is the value computed by the method we want to evaluate and $y$ is the true value, computed in this case using <a href="http://mpmath.org/">mpmath</a>, a library for arbitrary precision arithmetic.
The plot below shows this relative accuracy (lower is better) for the previous methods and over a wide range of inputs:
</p>
<figure class="fullwidth">
<img src="/images/2019/precision_logloss.svg" alt="">
</figure>
<p>
<span class="marginnote">
<b>Relative precision for the logistic loss computation</b> (lower is better). While most methods do reasonably well for negative values, the precision quickly deteriorates in the positive regime. Values above 1 indicate that the routine gave NaN.
<br />
<a href="https://colab.research.google.com/gist/fabianp/2e3fbd3cc9046a87de59fff1603bbe5b/logistic_regression.ipynb">
<img src="/images/2019/colab-badge.svg" alt="Open In Colab" /></a>
</span>
</p>
<p>
This plot shows that the log1pexp indeed has a much greater accuracy than the alternative implementations. It also shows the poor accuracy of logsumexp in the positive domain. We found this surprising since we were expecting scipy's logsumexp to provide a more accurate evaluation of the log-sum-exp expression. While its true that it does not suffer from overflow, other than that it doesn't provide a more accurate implementation than the naive implementation.
</p>
<p>
<b>TL;DR:</b> Use log1pexp. The naive implementation will overflow and give poor accuracy, logsumexp will yield a poor accuracy. log1pexp suffers from these issues the least.
</p>
<h2>A more stable implementation of the logistic loss</h2>
<p>
With the results from the previous section we can now write a more stable version of the logistic loss.
</p>
<p>
One last trick that we will use is to use the formula $\log(1 - s(z)) = -z + \log(s(z))$ to simplify \eqref{eq:logloss} slightly so that it becomes
\begin{equation}\label{eq:logistic2}
f(\xx) = \frac{1}{n}\sum_{i=1}^n (1 - b_i) \aa_i^\intercal \xx - \log(s(\aa_i^\intercal \xx))~.
\end{equation}
</p>
<p>
And finally, here is the full Python implementation. The routine logsig is slightly more convoluted that the ones presented before because it performs the previous routine component-wise without for loops for efficiency on large arrays:
</p>
<pre><code class="python">import numpy as np
from scipy import special
def logsig(x):
"""Compute the log-sigmoid function component-wise."""
out = np.zeros_like(x)
idx0 = x < -33
out[idx0] = x[idx0]
idx1 = (x >= -33) & (x < -18)
out[idx1] = x[idx1] - np.exp(x[idx1])
idx2 = (x >= -18) & (x < 37)
out[idx2] = -np.log1p(np.exp(-x[idx2]))
idx3 = x >= 37
out[idx3] = -np.exp(-x[idx3])
return out
def f(x, A, b):
"""Logistic loss, numerically stable implementation.
Parameters
----------
x: array-like, shape (n_features,)
Coefficients
A: array-like, shape (n_samples, n_features)
Data matrix
b: array-like, shape (n_samples,)
Labels
Returns
-------
loss: float
"""
z = np.dot(A, x)
b = np.asarray(b)
return np.mean((1 - b) * z - logsig(z))
</code></pre>
<p>
In the same example as before, this implementation does not overflow anymore:
</p>
<pre><code class="python">>>> A = [[1, 1]]
>>> b = [1]
>>> f([20, 20], A, b)
0.0</code></pre>
<h2>Gradient</h2>
<p>
For optimization, having an accurate evaluation the gradient is crucial to achieve a solution with high accuracy. In this section I'll compare different approaches to compute this gradient.
</p>
<p>
Computation of the gradient of the logistic loss simplifies considerably thanks to the following identity verified by the sigmoid function:
\begin{equation}\label{eq:diffeq_sigma}
\frac{d}{dt}s(t) = s(t)(1 - s(t))~.
\end{equation}
In particular, this implies that
\begin{equation}
\label{eq:dif_log}
\frac{d}{dt}\log(s(t)) = \frac{\frac{d}{dt}s(t)}{s(t)} = 1 - s(t)
\end{equation}
</p>
<p>
Using these identities and \eqref{eq:logistic2}, we can write the gradient of the logistic loss as
\begin{equation}
\label{eq:gradient}
\nabla f(\xx) =~ \frac{1}{n}\sum_{i=1}^n\aa_i (s(\aa_i^\intercal \xx) - b_i)
\end{equation}
</p>
<p>
As we will see, there are also some numerical issues with the computation of $b_i - s(\cdot)$ that can be problematic when computing this expression. We will now compare three different implementations of $g(t) \defas s(t) - b$:
</p>
<p>
1. <b>naive</b>. The previous equation is easy to implement using SciPy's sigmoid function (<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html">special.expit</a>):
</p>
<pre><code class="python">from scipy import special
def g_naive(t, b):
return special.expit(t) - b
</code></pre>
<p>Unfortunately, the subtraction of $b_i$ can cause a problematic loss of precision when this is close to 1:
</p>
<pre><code class="python">>>> g_naive(40, 1)
0.0
</code></pre>
<p>
while the correct answer is $\approx -4.24 \times 10^{-18}$.
</p>
<p>
2. <b>g_v2</b>. We can avoid precision loss of the previous example by reformulating the problematic expression into a single fraction as
\begin{equation}
s(t) - b = \frac{1}{1 + e^{-t}} - b = \frac{(1 - b) - b e^{-t}}{1 + e^{-t}}\label{eq:expit_b}
\end{equation}
In this case, when $b=1$ the expression becomes $1 / (e^t + 1)$, which doesn't suffer from the same issue.
</p>
<pre><code class="python">import numpy as np
def g_v2(t, b):
exp_nt = np.exp(-t)
return ((1 - b) - b * exp_nt) / (1 + exp_nt)
</code></pre>
<p>
Which as advertised gives a more accurate answer in the previous case:
</p>
<pre><code class="python">>>> g_v2(40, 1)
-4.248354255291589e-18
</code></pre>
<p>While better than the naive approach, this routine suffers from some issues. For example, the exponential will overflow for large values of $t$ (starting around 800), with this routine returning NaN:</span></p>
<pre><code class="python">>>> g_v2(800, 1)
RuntimeWarning: overflow encountered in exp
RuntimeWarning: invalid value encountered in double_scalars
nan
</code></pre>
<p>
3. <b>g_sign</b>. We can fix this last problem by avoiding to evaluate the exponential function with a large input. For this, we multiply both sides of \eqref{eq:expit_b} by $e^{-t}$ to arrive at the equivalent expression $((1 - b) - b e^{-t}) / (1 + e^{-t})$, which does not have any terms on $e^t$. Finally, we choose this formula whenever $t> 0$ and the previous one otherwise. This gives the following code:
</p>
<pre><code class="python">import numpy as np
def g_sign(t, b):
if t < 0:
exp_t = np.exp(t)
return ((1 - b) * exp_t - b) / (1 + exp_t)
else:
# same approach as g_v2
exp_nt = np.exp(-t)
return ((1 - b) - b * exp_nt) / (1 + exp_nt)
</code></pre>
<p>
And we can check that this no longer overflows:
</p>
<pre><code class="python">>>> g_sign(800, 1)
0.0
</code></pre>
<h3>Comparison</h3>
<p>
We now compare relative accuracy of these approaches. In the plot below, we show the relative accuracy (lower is better) a function of the input.
</p>
<figure class="fullwidth">
<img src="/images/2019/precision_grad_logloss.svg" alt="">
</figure>
<p>
<span class="marginnote">
<b>Relative precision for different implementations of the logistic loss's gradient</b> (lower is better).The naive method quickly suffers from relative of precision in the positive segment. expit_b exhibits a better accuracy but outputs NaN for large values of the input (values above 1 indicate NaN). expit_sign has none of these issues and has the best overall accuracy.
<br />
<a href="https://colab.research.google.com/gist/fabianp/2e3fbd3cc9046a87de59fff1603bbe5b/logistic_regression.ipynb">
<img src="/images/2019/colab-badge.svg" alt="Open In Colab" /></a>
</span>
</p>
<p>
This plot shows that the sign version has an accuracy at least as good as the v2 function, minus the indeterminacy for large values of the input.
</p>
<p>
<b>TL;DR:</b> use g_sign.
</p>
<h2>A more stable gradient.</h2>
<p>
We can now use the more stable routines developed in the last section to compute the full gradient. As before, the auxiliary routine (expit_b in this case) is slightly more convoluted that the ones presented before because it performs the previous routine component-wise without for loops for efficiency on large arrays:<dt-note>A full implementation of the logistic loss and its gradients with some bells and whistles like support for sparse matrices (and incidentally, the main motivation for writing this post) can be found in the optimization package <a href="http://openopt.github.io/copt/generated/copt.utils.LogLoss.html#copt.utils.LogLoss">copt</a>.
</dt-note>
</p>
<pre><code class="python">import numpy as np
def expit_b(x, b):
"""Compute sigmoid(x) - b component-wise."""
idx = x < 0
out = np.zeros_like(x)
exp_x = np.exp(x[idx])
b_idx = b[idx]
out[idx] = ((1 - b_idx) * exp_x - b_idx) / (1 + exp_x)
exp_nx = np.exp(-x[~idx])
b_nidx = b[~idx]
out[~idx] = ((1 - b_nidx) - b_nidx * exp_nx) / (1 + exp_nx)
return out
def f_grad(x, A, b):
"""Computes the gradient of the logistic loss.
Parameters
----------
x: array-like, shape (n_features,)
Coefficients
A: array-like, shape (n_samples, n_features)
Data matrix
b: array-like, shape (n_samples,)
Labels
Returns
-------
grad: array-like, shape (n_features,)
"""
z = A.dot(x)
s = expit_b(z, b)
return A.T.dot(s) / A.shape[0]
</code></pre>
<h2>Conclusion</h2>
<p>
A naive coding of the logistic loss and its gradient suffers numerical issues that go from indeterminacy to loss of precision. In this post I compared different approaches that can be used to mitigate this problem.
Machine learning software typically implements some of these approaches, as obtaining a single NaN value during training can be fatal. For example, both <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py">scikit-learn</a> and <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/logistic-loss.h">tensorflow</a> make a distinction of cases on the sign for the gradient, as the g_sign method above.
</p>
<p>
<b>A word about speed.</b> The speed difference between the different approaches is almost none for high-dimensional problems since the cost of evaluating the logistic function or its gradient is negligible compared that of the dot vector product $\aa_i^\intercal \xx$. For example, on a 20.000-dimensional square problem, the timing of f and f_naive is almost the same, with a slight difference for the latter since it avoids some evaluations of the log and exp:
</p>
<figure>
<pre><code class="python">>>> A = np.random.randn(2000, 2000)
>>> b = (np.sign(np.random.randn(2000)) + 1) // 2
>>> x = np.random.randn(2000)
>>> %timeit f(x, A, b)
1.74 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit f_naive(x, A, b)
1.97 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
</code></pre>
<h3>Thanks</h3>
<p>
to <a href="https://scholar.google.ca/citations?user=xdlBKc8AAAAJ&hl=fr">Pierre-Antoine Manzagol</a> for many comments on this blog post.
</p>
<p>
<b>Edit</b> (January 2021). A few months after I wrote this blog post, the following excelent survey was published on the related softmax function. <dt-cite key="10.1093/imanum/draa038"></dt-cite>
<br><br>
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
Notes on the Frank-Wolfe Algorithm, Part II: A Primal-dual Analysis2018-11-17T00:00:00+01:002018-11-17T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2018-11-17:/blog/2018/fw2/
<p>This blog post extends the convergence theory from the <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">first part of these notes</a> on the
Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize
and strengthen the convergence guarantees obtained in the first part. </p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax …</script>
<p>This blog post extends the convergence theory from the <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">first part of these notes</a> on the
Frank-Wolfe (FW) algorithm with convergence guarantees on the primal-dual gap which generalize
and strengthen the convergence guarantees obtained in the first part. </p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@inproceedings{jaggi2013revisiting,
title={Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.},
author={Jaggi, Martin},
journal={Proceedings of the 30th International Conference on Machine Learning},
year={2013},
url={http://proceedings.mlr.press/v28/jaggi13-supp.pdf}
}
@article{bach2015duality,
title={Duality between subgradient and conditional gradient methods},
author={Bach, Francis},
journal={SIAM Journal on Optimization},
year={2015},
publisher={SIAM},
url={https://arxiv.org/pdf/1211.6302.pdf}
}
@article{clarkson2010coresets,
title={Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm},
author={Clarkson, Kenneth L},
journal={ACM Transactions on Algorithms (TALG)},
year={2010},
publisher={ACM},
url={http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.9299&rep=rep1&type=pdf}
}
@inproceedings{lacoste2013block,
title={Block-Coordinate Frank-Wolfe Optimization for Structural SVMs},
author={Lacoste-Julien, Simon and Jaggi, Martin and Schmidt, Mark and Pletscher, Patrick},
journal={International Conference on Machine Learning},
year={2013}
}
@inproceedings{bauschke2017convex,
title={Convex Analysis and Monotone Operator Theory in Hilbert Spaces},
author={Bauschke, Heinz and Combettes, Patrick},
journal={CMS Books in Mathematics},
year={2013},
url={https://doi.org/10.1007/978-3-319-48311-5}
}
@inproceedings{frank1956algorithm,
author = {Frank, Marguerite and Wolfe, Philip},
title = {An algorithm for quadratic programming},
journal = {Naval Research Logistics Quarterly},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109},
year={1956}
}
@inproceedings{dem1967minimization,
title={The minimization of a smooth convex functional on a convex set},
author={Demyanov, Vladimir and Rubinov, Alexander},
journal={SIAM Journal on Control},
year={1967},
url={https://doi.org/10.1137/0305019}
}
@inproceedings{nesterov2018complexity,
title={Complexity bounds for primal-dual methods minimizing the model of objective function},
author={Nesterov, Yu},
journal={Mathematical Programming},
year={2018},
url={https://doi.org/10.1007/s10107-017-1188-6}
}
@article{pedregosa2018step,
title={Step-Size Adaptivity in Projection-Free Optimization},
author={Pedregosa, Fabian and Askari, Armin and Negiar, Geoffrey and Jaggi, Martin},
journal={arXiv:1806.05123},
year={2018},
url={https://arxiv.org/pdf/1806.05123.pdf}
}
@article{bach2015duality,
title={Duality between subgradient and conditional gradient methods},
author={Bach, Francis},
journal={SIAM Journal on Optimization},
year={2015},
url={https://doi.org/10.1137/130941961}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\balpha{\boldsymbol \alpha}
\def\bmu{\boldsymbol \mu}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\dd{\boldsymbol d}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\vv{\boldsymbol v}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\TT{\boldsymbol T}
\def\Econd{\boldsymbol E}
\def\CC{\boldsymbol C}
\def\AA{\boldsymbol A}
\def\RR{\mathbb R}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\dom}{\mathbf{dom}}
\DeclareMathOperator*{\diam}{\mathbf{diam}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\def\defas{\stackrel{\text{def}}{=}}
$$
</div>
<h2 id="intro">Introduction</h2>
<p>Although FW is one of the oldest methods in nonlinear constrained optimization,<dt-cite key="frank1956algorithm"></dt-cite>
<dt-cite key="dem1967minimization"></dt-cite> the development of primal-dual guarantees is
relatively recent. In 2010, Clarkson<dt-cite key="clarkson2010coresets"></dt-cite> gave
primal-dual convergence rates in the particular case in which the domain is the simplex. These
results were later extended by Jaggi to arbitrary domains.<dt-cite key="jaggi2013revisiting">
</dt-cite>
</p>
<p>These results however gave convergence guarantees for the minimum over all iterates, and not on
the last one, as is usual in the case of <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/#theorem2">primal
suboptimality</a>. It is not until 2017, with the work of Nesterov,<dt-cite key="nesterov2018complexity"></dt-cite> that primal-dual convergence rates on the last iterate
were obtained. These primal-dual guarantees were recently extended by myself and coauthors to
other FW variant like Away-steps FW, Pairwise FW and FW with backtracking line search.<dt-cite key="pedregosa2018step"></dt-cite>
</p>
<p>In this post I will only explore primal-dual convergence guaratees. However, duality theory
also reveals other connections between FW and other apparently disparate methods. For example,
in the paper "Duality between subgradient and conditional gradient methods",<dt-cite key="bach2015duality"></dt-cite> Bach showed an equivalence between the FW algorithm and
mirror descent on a dual problem.</p>
<h2 id="problem">Problem and Notation</h2>
<p>As in the first part of these notes, I will be discussing the FW method applied to the
following general optimization problem</p>
<p style="background-color: #D2E4FC; padding: 1px; border-radius: 8px;">
\begin{equation}\label{eq:fw_objective}
\argmin_{\xx \in \RR^d}\, \left\{\mathcal{P}(\xx) \defas f(\boldsymbol{x}) +
\imath_{\mathcal{C}}(\xx)\right\}~,
\end{equation}
</p>
<p>where $f$ is differentiable with $L$-Lipschitz gradient and $\imath_{\mathcal{C}}$ is the
indicator function over a convex and compact domain $\mathcal{C}$, that is, the function that
returns $0$ if the argument belongs to $\mathcal{C}$ and $+\infty$ otherwise.<dt-note>
\eqref{eq:fw_objective} is equivalent to the constrained problem $\argmin_{\xx \in
\mathcal{C}} f(\xx)$ that I used in the <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">first part</a> of these notes,
although for our purposes the formulation with the indicator function will be more convenient.
</dt-note> the domain $\mathcal{C}$ is a convex and compact set the domain of $g$ is compact a
compact subset of $\RR^d$. In this post we will also assume that
$f$ is convex.
</p>
<p>From an initial guess $\xx_0$, the FW algorithm generates a sequence of iterates $\xx_1, \xx_2,
\ldots$ that converge towards the solution to \eqref{eq:fw_objective}. Below is the pseudo-code
for this algorithm, as presented in the <a href="blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/">first part</a> of these notes:
</p>
<p class="framed">\begin{align}
&\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber\\
& \textbf{For }t=0, 1, \ldots \textbf{ do } \\
&\quad \balpha_t = \nabla f(\boldsymbol{x}_t)\\
&\quad\boldsymbol{s}_t \in {\textstyle\argmax_{\boldsymbol{s} \in \mathcal{C}}} \langle -
\balpha_t, \boldsymbol{s}\rangle\label{eq:lmo}\\
&\quad \boldsymbol{d}_t = \ss_t - \xx_t\\
&\quad g_t = -\langle \balpha_t, \dd_t \rangle\\
&\quad \textbf{If } g_t < \delta: \\ &\quad\qquad\hfill\text{// exit if gap is below
tolerance }\nonumber\\ &\quad\qquad\textbf{return } \xx_t\\ &\quad {\textbf{Variant
1}}: \text{set step size as} \nonumber\\
&\quad\qquad\gamma_t=\vphantom{\sum_i}\min\Big\{\frac{g_t}{L\|\dd_t\|^2}, 1
\Big\}\label{eq:step_size}\\ &\quad \textbf{Variant 2}: \text{set step size by line
search}\nonumber\\ &\quad\qquad\gamma_t=\argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma
\boldsymbol{d}_t)\label{eq:line_search}\\ &\quad\boldsymbol{x}_{t+1}=\boldsymbol{x}_t +
\gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}\\ &\textbf{end For loop}\\ &
\textbf{return } \xx_t \end{align} <br>
</p>
<h2 id="duality">Duality and Frank-Wolfe</h2>
<p>In convex optimization, every problem can be paired with another, said to be <i>dual</i> to
it. On the basis of this, close connections between otherwise disparate properties arise.
The FW algorithm has deep and unexpected links with duality, which we will now explore.</p>
<p>The <i>dual problem</i> is a concave maximization problem that we can associate with our
original convex minimization problem. Then by definition of convex conjugate<dt-note key="conjugate"><b>Definition (convex conjugate).</b> Let $f: \RR^d \to [-\infty, \infty]$
be an extended real-valued function. Its convex conjugate, denoted $f^*$, is the function
$f: \RR^d \to [-\infty, \infty]$ defined by
$$
f^{{*}}(\yy) \defas \sup_{\xx \in \RR^d} \left \{ \langle \yy , \xx \rangle - f \left( \xx
\right) \right\}\,,\qquad \yy \in \RR^d\,.
$$</dt-note> we then have the following sequence of equivalences that starting from the
the original (primal) problem:
\begin{align}
&\min_{\xx \in \RR^d} \mathcal{P}(\xx) = \min_{\xx \in \RR^d} f(\xx) +
\imath_{\mathcal{C}}(\xx)\\
&= \min_{\xx \in \RR^d} \max_{\balpha \in \RR^d}\Big\{\langle \xx, \balpha\rangle -
f^*(\balpha)\Big\} + \imath_{\mathcal{C}}(\xx)\\
&= \max_{\balpha \in \RR^d}\min_{\xx \in \RR^d}\Big\{ \imath_{\mathcal{C}}(\xx) +
\langle \xx,\balpha\rangle\Big\} - f^*(\balpha)\\
&= \max_{\balpha \in \RR^d}\underbrace{-\imath_{\mathcal{C}}^*(-\balpha) -
f^*(\balpha)}_{\defas \mathcal{D}(\balpha)}~,\label{eq:pd_relationship}
\end{align}
where in the second identity we have used <a href="https://en.wikipedia.org/wiki/Sion%27s_minimax_theorem">Sion's minimax theorem</a>
to exchange the $\max$ and $\min$. Note also that we can use $\max$ instead of the usual
$\sup$ in the definition of Fenchel conjugate because of the smoothness of $f$, which
implies strong convexity of $f^*$.
We will call \eqref{eq:pd_relationship}, which is a concave maximization problem, the
<i>dual problem</i>. By extension, we will refer to $\mathcal{D}$ as the <i>dual
objective</i> or
dual function.
</p>
<p>The dual objective lower bounds the primal objective and so the difference between primal
and dual objectives gives a bound on the suboptimality. In other words, let $\xx^\star$ be
any solution to the original primal problem \eqref{eq:fw_objective}. Then by definition of
dual objective we have
\begin{equation}
f(\xx) - f(\xx^\star) \leq f(\xx) - \mathcal{D}(\balpha)~,
\end{equation}
for any $\xx \in \mathcal{C}$ and any $\balpha \in \dom(\mathcal{C})$.
This is quite exceptional, as gives a meaningful distance to optimum without need to know
$\xx^\star$, which is of course unknown. This can then be used for instance as stopping
criterion in optimization algorithms. The quantity $f(\xx) - \mathcal{D}(\balpha)$ is often
referred to as the <i>primal-dual</i> gap.
</p>
<p>However, computing the primal-dual gap involves evaluating the dual objective, which in the
general case can be as costly as solving the original problem. What is special in the case
of FW is that the primal-dual gap is given as a byproduct of the method. This is quite
unique among optimization methods. </p>
<p>The next lemma shows that the primal-dual gap (for a specific choice of primal and dual
variables) is exactly equal to the FW gap $g_t$. </p>
<p class="framed" id="lemma2"> <b>Lemma 2</b>. The FW gap $g_t$ is a duality gap:
\begin{equation}
g_t = \mathcal{P}(\xx_t) - \mathcal{D}(\nabla f(\xx_t))~.
\end{equation}
</p>
<div class="proof">
<p>We have the following sequence of identities
\begin{align}
g_t &= \max_{\ss \in \mathcal{C}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle\\
& = \langle \nabla f(\xx_t), \xx_t\rangle + {\max_{\ss \in \RR^d}}\Big\{ \langle-
\nabla f(\xx_t), \ss\rangle - \imath_{\mathcal{C}}(\ss)\Big\}\\
& = \langle \nabla f(\xx_t), \xx_t\rangle + \imath^*_{\mathcal{C}}(-\nabla f(\xx_t))\\
& = f(\xx_t) + \underbrace{\imath^*_{\mathcal{C}}(-\nabla f(\xx_t)) +
f^*(\nabla f(\xx_t))}_{=-\mathcal{D}(\nabla f(\xx_t))}~,
\end{align}
where the first identity uses the definition of $\ss_t$, the second one the definition of
convex conjugate and the last one is a consequence of the Fenchel-Young identity (see for
example Proposition 16.10 of Bauschke and Combettes).<dt-cite key="bauschke2017convex">
</dt-cite></span>
</p>
</div>
<h2>Duality Gap Convergence Rate</h2>
<p>The next theorem gives an $\mathcal{O}(1/t)$ convergence rate for FW on the duality gap and
is the main result in this post.<dt-note>A similar but more general theorem which supports
non exact oracles and other step sizes is given in <a href="https://arxiv.org/pdf/1806.05123.pdf">Theorem 3 of this paper</a>.</dt-note>
Note that since the dual gap $\mathcal{P}(\xx_t) - \mathcal{D}(\nabla f(\xx_t) )$ is an
upper bound on the primal
suboptimality $f(\xx_t) - f(\xx^\star)$, this is strictly stronger and implies the
convergence rates we got in <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/#theorem2">Theorem 2 in first
part of these notes</a></p>
<p class="framed"><b>Theorem 3</b> . Let $\bmu$ be defined recursively as $\bmu_0 = \nabla
f(\xx_0)$, $\bmu_{t+1} = (1 - \xi_t)\bmu_t + \xi_t\nabla f(\xx_t)$, with $\xi_t = 2 /
(t + 2)$. Then we have:
\begin{equation}
\mathcal{P}(\xx_t) - \mathcal{D}(\bmu_t) \leq \frac{2L\diam(\mathcal{C})^2}{t + 1} =
\mathcal{O}\left(\frac{1}{t}\right)~.
\end{equation}
</p>
<div class="proof">
<p>By the <a href="/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/#key-recursive">key
recursive inequality</a> we have the following inequality, valid for all $\xi_t \in [0,
1]$:
\begin{align}
\mathcal{P}(\xx_{t+1}) &\leq \mathcal{P}(\xx_t) - \xi_t g_t +
\frac{\xi_t^2 L}{2}\|\ss_t - \xx_t\|^2\\
&= (1 - \xi_t)\mathcal{P}(\xx_t) + \xi_t \mathcal{D}(\balpha_t) + \frac{\xi_t^2
L}{2}\|\ss_t -
\xx_t\|^2~,
\end{align}
where the last identity follows by <a href="#lemma2">Lemma 2</a>.
</p>
<p>
For the dual objectve, by Jensen's inequality we have $\mathcal{D}(\bmu_{t+1}) \geq (1 -
\xi_t)\mathcal{D}(\bmu_t) + \xi_t \mathcal{D}(\balpha_t)$. Adding this to the previous
inequality we have
\begin{equation}
\mathcal{P}(\xx_{t+1}) - \mathcal{D}(\bmu_{t+1}) \leq (1 - \xi_t)(\mathcal{P}(\xx_t) -
\mathcal{D}(\bmu_t)) + \frac{\xi_t^2
L}{2}\|\ss_t -
\xx_t\|^2~
\end{equation}
</p>
<p>
Let $\xi_t = 2 / (t+2)$ and $e_t \defas \frac{t(t+1)}{2}(f(\xx_t) - \mathcal{D}(\bmu_t))$.
Multiplying the previous equation by $(t+1)(t+2)/2$ then gives
\begin{align}
e_{t+1} &\leq e_t + \frac{(t+1)}{(t+2)}L\|\ss_t - \xx_t\|^2\\
&\leq e_t + L\diam(\mathcal{C})^2~.
\end{align}
Adding this last inequality from $0$ to $t-1$ gives
\begin{equation}\label{eq:sublinear_bound_sigma}
e_{t} \leq e_0 + t L \diam(\mathcal{C})^2 \implies f(\xx_t) - \mathcal{D}(\bmu_t) \leq \frac{2 L
\diam(\mathcal{C})^2}{t+1}
\end{equation}
</p>
</div>
<h2>Citing</h2>
<p>
If this post has been useful to you, please consider citing <a href="https://arxiv.org/pdf/1806.05123.pdf">its associated paper</a> that contains the above analysis and much more, including backtracking line-search and other Frank-Wolfe
variants:
<pre>
@inproceedings{pedregosa2020linearly,
title={Linearly Convergent Frank-Wolfe with Backtracking Line-Search},
author={Pedregosa, Fabian and Negiar, Geoffrey and Askari, Armin and Jaggi, Martin},
booktitle={Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics},
year={2020},
url={https://arxiv.org/pdf/1806.05123.pdf}
}
</pre>
</p>
<hr />
<h2>References</h2>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
<hr>
<figure class="fullwidth">
<p style="width: 100%"><b>Thanks.</b> A heartfelt thanks to <a href="http://nicolas.le-roux.name/">Nicolas Le Roux</a>, <a href="http://vene.ro/">Vlad
Niculae</a> and <a href="https://konstmish.github.io/">Konstantin Mishchenko</a>
for reporting typos on this post.</p>
</figure>
<hr />
Three Operator Splitting2018-09-06T00:00:00+02:002018-09-06T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2018-09-06:/blog/2018/tos/I discuss a recently proposed optimization algorithm: the Davis-Yin three operator splitting.
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/javascript" src="/theme/js/bibtexParse.js"></script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = `
@article{davis2017three,
title={A three-operator splitting scheme and its optimization applications},
author={Davis, Damek and Yin, Wotao},
journal={Set-valued and variational analysis},
year={2017},
url={https://doi.org/10.1007/s11228-017-0421-z}}
@article{davis2015three,
title={A Three-Operator Splitting Scheme and its Optimization Applications},
author={Davis, Damek and Yin, Wotao},
journal={arXiv preprint arXiv:1504.01032},
year={2015},
url={https://arxiv.org/abs/1504.01032}
}
@article{condat2013primal,
title={A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms},
author={Condat, Laurent},
journal={Journal of Optimization Theory and Applications},
year={2013},
url={https://doi.org/10.1007/s10957-012-0245-9},
openaccess={https://hal.archives-ouvertes.fr/hal-00609728v5}
}
@article{pedregosa18a,
title = {Adaptive Three Operator Splitting},
author = {Pedregosa, Fabian and Gidel, Gauthier},
journal = {Proceedings of the 35th International Conference on Machine Learning},
year = {2018},
pdf = {http://proceedings.mlr.press/v80/pedregosa18a/pedregosa18a.pdf},
url = {http://proceedings.mlr.press/v80/pedregosa18a.html},
}
@book{asmussen2007stochastic,
title={Stochastic simulation: algorithms and analysis},
author={Asmussen, Søren and Glynn, Peter W},
volume={57},
year={2007},
journal={Springer Science \& Business Media},
url={http://dx.doi.org/10.1007/978-0-387-69033-9},
}
@article{mclachlan2002splitting,
title={Splitting methods},
author={McLachlan, Robert I and Quispel, G Reinout W},
journal={Acta Numerica},
volume={11},
year={2002},
url={https://doi.org/10.1017/S0962492902000053}
}
@incollection{macnamara2016operator,
title={Operator splitting},
author={MacNamara, Shev and Strang, Gilbert},
journal={Splitting Methods in Communication, Imaging, Science, and Engineering},
year={2016},
publisher={Springer},
url={https://doi.org/10.1007/978-3-319-41589-5_3}
}
@article{boyd2011distributed,
title={Distributed optimization and statistical learning via the alternating direction method of multipliers},
author={Boyd, Stephen and Parikh, Neal and Chu, Eric and Peleato, Borja and Eckstein, Jonathan and others},
journal={Foundations and Trends in Machine learning},
year={2011},
url={http://dx.doi.org/10.1561/2200000016}
}
@incollection{combettes2011proximal,
title={Proximal splitting methods in signal processing},
author={Combettes, Patrick L and Pesquet, Jean-Christophe},
journal={Fixed-point algorithms for inverse problems in science and engineering},
year={2011},
publisher={Springer},
url={https://doi.org/10.1007/978-1-4419-9569-8_10},
openaccess={https://arxiv.org/pdf/0912.3522.pdf}
}
@article{glowinski1975approximation,
title={Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires},
author={Glowinski, Roland and Marroco, A},
journal={Revue français d'automatique, informatique, recherche opérationnelle},
year={1975},
publisher={EDP Sciences},
url={http://www.numdam.org/article/M2AN_1975__9_2_41_0.pdf}
}
@article{gabay1976dual,
title={\href{https://doi.org/10.1016/0898-1221(76)90003-1}{A dual algorithm for the solution of nonlinear variational problems via finite element approximation}},
author={Gabay, Daniel and Mercier, Bertrand},
journal={Computers \& Mathematics with Applications},
year={1976},
publisher={Elsevier}
}
@Article{raguet2018,
author={Raguet, Hugo},
title={A note on the forward-Douglas--Rachford splitting for monotone inclusion and convex optimization},
journal={Optimization Letters},
year={2018},
url={https://doi.org/10.1007/s11590-018-1272-8},
openaccess={https://arxiv.org/pdf/1704.06948.pdf}
}
@article{beck2009gradient,
title={Gradient-based algorithms with applications to signal recovery},
author={Beck, Amir and Teboulle, Marc},
journal={Convex optimization in signal processing and communications},
year={2009},
url={https://ie.technion.ac.il/~becka/papers/gradient_chapter.pdf}
}
@article{Teboulle2018,
author="Teboulle, Marc",
title="A simplified view of first order methods for optimization",
journal="Mathematical Programming",
year="2018",
url="https://doi.org/10.1007/s10107-018-1284-2"
}
@article{cuppen1980divide,
title={A divide and conquer method for the symmetric tridiagonal eigenproblem},
author={Cuppen, Jan JM},
journal={Numerische Mathematik},
year={1980},
publisher={Springer}
}
@article{gidel17a,
title = {Frank-Wolfe Algorithms for Saddle Point Problems},
author = {Gauthier Gidel and Tony Jebara and Simon Lacoste-Julien},
journal = {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics},
year = {2017},
pdf = {http://proceedings.mlr.press/v54/gidel17a/gidel17a.pdf},
url = {http://proceedings.mlr.press/v54/gidel17a.html},
}
@article{chambolle2016ergodic,
title={On the ergodic convergence rates of a first-order primal--dual algorithm},
author={Chambolle, Antonin and Pock, Thomas},
journal={Mathematical Programming},
year={2016},
publisher={Springer},
url={https://doi.org/10.1007/s10107-015-0957-3}
}
@article{malitsky2018first,
title={A first-order primal-dual algorithm with linesearch},
author={Malitsky, Yura and Pock, Thomas},
journal={SIAM Journal on Optimization},
year={2018},
publisher={SIAM},
url={https://doi.org/10.1137/16M1092015},
openaccess={https://arxiv.org/abs/1608.08883}
}
@book{hiriart2013convex,
title={Convex analysis and minimization algorithms I: Fundamentals},
author={Hiriart-Urruty, Jean-Baptiste and Lemaréchal, Claude},
year={1993},
journal={Springer science \& business media},
url={http://dx.doi.org/10.1007/978-3-662-02796-7}
}
@book{rockafellar1970convex,
title={Convex analysis},
author={Rockafellar, Ralph Tyrell},
year={1970},
journal={Princeton University Press}
}
@article{vu2013splitting,
title={A splitting algorithm for dual monotone inclusions involving cocoercive operators},
author={Vũ, Bằng Công},
journal={Advances in Computational Mathematics},
year={2013},
publisher={Springer},
url={https://doi.org/10.1007/s10444-011-9254-8}
}
@article{raguet2013generalized,
title={A generalized forward-backward splitting},
author={Raguet, Hugo and Fadili, Jalal and Peyré, Gabriel},
journal={SIAM Journal on Imaging Sciences},
year={2013},
publisher={SIAM},
url={https://doi.org/10.1137/120872802},
}
@article{briceno2015forward,
title={Forward-Douglas--Rachford splitting and forward-partial inverse method for solving monotone inclusions},
author={Briceño-Arias, Luis M},
journal={Optimization},
year={2015},
publisher={Taylor \& Francis},
url={https://arxiv.org/pdf/1212.5942.pdf},
}
@article{ryu2018operator,
title={Operator Splitting Performance Estimation: Tight contraction factors and optimal parameter selection},
author={Ryu, Ernest K and Taylor, Adrien B and Bergeling, Carolina and Giselsson, Pontus},
journal={arXiv preprint arXiv:1812.00146},
year={2018},
url={https://arxiv.org/abs/1812.00146}
}
`
document.addEventListener('DOMContentLoaded', doReferences, false);
</script>
<div style="display: none">
$$
\def\aa{\boldsymbol a}
\def\bb{\boldsymbol b}
\def\cc{\boldsymbol c}
\def\xx{\boldsymbol x}
\def\zz{\boldsymbol z}
\def\uu{\boldsymbol u}
\def\vv{\boldsymbol v}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\TT{\boldsymbol T}
\def\CC{\boldsymbol C}
\def\RR{\mathbb R}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\Fix}{\mathbf{Fix}}
\DeclareMathOperator{\prox}{\mathbf{prox}}
\def\defas{\stackrel{\text{def}}{=}}
$$
</div>
<p>Splitting methods decompose a complicated problem into a sequence of simpler subproblems. An idea as fundamental as this one has found applications in many areas like linear algebra,<dt-cite key="cuppen1980divide"></dt-cite> integration of ordinary differential equations,<dt-cite key="mclachlan2002splitting"></dt-cite> <dt-cite key="macnamara2016operator"></dt-cite> or the Monte-Carlo estimation of expectations,<dt-cite key="asmussen2007stochastic"></dt-cite> to name a few.
</p>
<p>Splitting methods have also made their way in mathematical optimization. From solving non-smooth problems<dt-cite key="combettes2011proximal"></dt-cite> to parallel computing,<dt-cite key="boyd2011distributed"></dt-cite> some of the most effective methods are based on splitting. A method that has recently caught my eye is the Davis-Yin three operator splitting,<dt-cite key="davis2017three"></dt-cite> which is the subject of this post.</p>
<p class="framed">
<b>Outline:</b><br />
<span style="margin-left: 20px"><a href="#introduction">Introduction</a></span><br />
<span style="margin-left: 20px"><a href="#analysis">Iteration Complexity Analysis</a></span><br />
<span style="margin-left: 20px"><a href="#convergence">Convergence Theory</a></span><br />
<span style="margin-left: 20px"><a href="#code">Code</a></span><br />
<span style="margin-left: 20px"><a href="#openquestions">Open Questions</a></span><br />
<span style="margin-left: 20px"><a href="#refs">References</a></span><br />
</p>
<h2 id="introduction">Introduction</h2>
<p>The Davis-Yin <b>three operator splitting</b> is an algorithm that can solve optimization problems composed of a sum of three terms and was proposed by <a href="https://people.orie.cornell.edu/dsd95/">Damek Davis</a> and <a href="http://www.math.ucla.edu/~wotaoyin/">Wotao Yin</a>. The method can be seen as a slight generalization of previous methods like the Forward-Douglas-Rachford<dt-cite key="briceno2015forward"></dt-cite> and the Generalized Forward-Backward,<dt-cite key="raguet2013generalized"></dt-cite> a fact that for some was not appropriately acknowledged.<dt-cite key="raguet2018"></dt-cite> Despite this controversy, the method remains in my opinion one of the most exciting recent developments in optimization because of its excellent empirical performance, elegant formulation and its ease of use.</p>
<p>The three operator splitting can solve optimization problems of the form</p>
<p style="background-color: #D2E4FC; padding: 1px; border-radius: 8px;">
\begin{equation}\label{eq:opt_objective}\tag{OPT}
\minimize_{\boldsymbol{x} \in \RR^d} f(\xx) + g(\xx) + h(\xx) ~,
\end{equation}
</p>
<p>where we have access to the gradient of $f$ and the proximal operator of $g$ and $h$.<dt-note key="prox">The proximal operator is a crucial element of many splitting methods. For a function $\varphi: \RR^d \to \RR$ and step-size $\gamma > 0$ is denoted $\prox_{\gamma \varphi}$ and defined as
$$
\prox_{\gamma \varphi}(\xx) \defas \argmin_{\zz \in \RR^p}\Big\{ \varphi(\zz) + \frac{1}{2 \gamma}\|\xx - \zz\|^2\Big\}~.
$$ Despite its simplicity, the proximal operator is at the core of many optimization methods. The following monograph provides a gentle introduction to the topic: Parikh, Neal, and Stephen Boyd. <a href="http://dx.doi.org/10.1561/2400000003">"Proximal algorithms."</a> <i>Foundations and Trends in Optimization</i>, 2014.
</dt-note> This formulation allows to express a broad range of problems arising in machine learning and signal processing: $f$ can be any smooth loss function such as the logistic or least squares loss, while the two proximal terms can be generalized to an arbitrary number of them and includes many important penalties like the group lasso with overlap, total variation, $\ell_1$ trend filtering, etc. Furthermore, the penalties can be extended-valued, thus allowing an intersection for convex constraints through the use of the indicator function. </p>
<p>In its basic form, the algorithm takes as input an initial gues $\yy_0 \in \RR^d$ and generates a sequence of iterates $\{\yy_1, \yy_2, \ldots\}$ through the formula
\begin{align}
\zz_t &= \prox_{\gamma h}(\yy_t)~\label{eq:z_update}\\
\xx_t &= \prox_{\gamma g}(2 \zz_t - \yy_t - \gamma \nabla f(\zz_t))\label{eq:x_update}\\
\yy_{t+1} &= \yy_t - \zz_t + \xx_t~.\label{eq:y_update}
\end{align}
</p>
<p>Each iteration requires to evaluate the proximal operator of $h$ and $g$, as well as the gradient of $f$. It depends on a single step-size parameter $\gamma$, which is a practical advantage with respect to similar methods like Condat-Vu<dt-cite key="condat2013primal"></dt-cite> <dt-cite key="vu2013splitting"></dt-cite> that depend on two step-size parameters.<label for="notecondatvu" class="margin-toggle sidenote-number"></label><input type="checkbox" id="notecondatvu" class="margin-toggle" /><span class="sidenote">Considering the three operator splitting and the Condat-Vu algorithm in the same bag is not entirely fair, as the latter can minimize a larger class of objectives of the form \begin{equation}
f(\xx) + g(\xx) + h(\boldsymbol{K}\xx)\,,
\end{equation} where $\boldsymbol{K}$ is an arbitrary linear operator. Compared to \eqref{eq:opt_objective}, this allows for an extra linear operator inside $h$, which is set to the identity in the three operator splitting. A recently proposed variant of the three operator splitting (<a href="https://doi.org/10.1007/s10915-018-0680-3">"A New Primal–Dual Algorithm for Minimizing the Sum of Three Functions with a Linear Operator"</a>, by Ming Yang) allows, as the Condat-Vu algorithm, a linear operator in one of the proximal terms. </span></p>
<p>The three operator splitting is strongly related to other splitting methods.
For example, when $h$ is constant, it defaults to the proximal gradient method. This is easy to verify since in this case the first step becomes $\zz_t = \prox_{\gamma h}(\yy_t) = \yy_t$, and replacing in the next lines gives $\yy_{t+1} = \prox_{\gamma g}(\yy_t - \gamma \nabla f(\yy_t))$, which corresponds to an iteration of the proximal-gradient method. Similarly, it is trivial to verify that whenever $f$ is constant then it defaults to the <a href="http://www.seas.ucla.edu/~vandenbe/236C/lectures/dr.pdf">Douglas-Rachford</a> method. A more detailed comparison with related methods can be found in (Raguet 2018).<dt-cite key="raguet2018"></dt-cite>
</p>
<h2 id="analysis">Iteration complexity analysis</h2>
<p>In this section I will give a simplified proof of the iteration complexity of the method for convex functions.</p>
<p><b>Assumptions</b>. I assume that $f$, $g$ and $h$ and convex, proper and <a href="https://en.wikipedia.org/wiki/Semi-continuity">lower semicontinuous</a> functions. In addition, $f$ is differentiable with Lipschitz gradient, where $L$ denotes its Lipschitz constant.<dt-note key="lipschitz">In other words, $f$ verifies
$$
\|\nabla f(\xx) - \nabla f(\yy)\|\leq L \|\xx - \yy\|
$$
for all $\xx, \yy$ in the domain. The class of functions that verifies this property is sometimes called $L$-smooth functions</dt-note></p>
<p>It is common to analyze optimization algorithms by proving that the objective function decreases at each iteration. However, this turns out to be problematic for the three operator splitting, as the value of the objective function can be infinity at any iterate. This happens for example when both $g$ and $h$ are the indicator function over a set. </p>
<p>Davis and Yin overcame this difficulty by assuming Lipschitz continuity of one of the proximal terms (Davis and Yin 2017, Theorem 3.1 and 3.2).<dt-cite key="davis2017three"></dt-cite> However, I find this not satisfactory because it leaves out important applications of the algorithm like optimization over an intersection of constraints.</p>
<p>A different approach was proposed in our recent paper,<dt-cite key="pedregosa18a"></dt-cite> <dt-note key="saddle_novelty">Although this approach seems to be new in the context of analyzing the three operator splitting, it is a classical approach for other primal-dual methods. See for instance the analysis of <a href="https://doi.org/10.1007/s10107-015-0957-3">(Chambolle and Pock, 2016)</a> for a different class of splitting methods. </dt-note> and consists in reformulating the original optimization as a saddle point problem. Let $h^*$ denotes the <a href="https://en.wikipedia.org/wiki/Convex_conjugate">convex conjugate</a> of $h$, then we have the following identities:
\begin{align}
&\min_{\xx \in \RR^d} f(\xx) + g(\xx) + h(\xx) \\
&\quad = \min_{\xx \in \RR^d} f(\xx) + g(\xx) + \max_{\uu \in \RR^d}\left\{\langle \xx, \uu \rangle - h^*(\uu)\right\}\\
&\quad= \min_{\xx \in \RR^d} \max_{\uu \in \RR^d} \underbrace{f(\xx) + g(\xx) + \langle \xx, \uu \rangle - h^*(\uu)}_{\defas \mathcal{L}(\xx, \uu)}~.
\end{align}
We have transformed the original minimization into the problem of finding a saddle point of $\mathcal{L}(\xx, \uu) = f(\xx) + g(\xx) + \langle \xx, \uu \rangle - h^*(\uu)$.
Formally, a saddle point of $\mathcal{L}$ is a pair $(\xx^\star, \uu^\star)$ such that the following is verified for any $(\xx, \uu)$ in the domain:<dt-cite key="hiriart2013convex"></dt-cite>
\begin{equation}\label{eq:saddle_point}
\mathcal{L}(\xx^\star\!, \uu) \leq \mathcal{L}(\xx, \uu^\star) ~.
\end{equation}
From this definition it follows that $\mathcal{L}(\xx^\star\!, \uu) - \mathcal{L}(\xx, \uu^\star)$ is non-positive for all $(\xx, \uu)$ only at a saddle point and so it is a meaningful convergence criterion.</p>
<p>An unexpected side effect of using the saddle point suboptimality is that it simplifies considerably the iteration complexity analysis. Below is a proof of the $\mathcal{O}(1/t)$ convergence rate on the saddle point suboptimality for the step-size $\gamma=\frac{1}{L}$. Later we will see that under some extra assumptions it is also possible to derive convergence rate on the objective suboptimality from this result.</p>
<p style="border: 1px black solid; padding: 20px"><b>Theorem 1 </b> (Ergodic convergence rate). Let $\yy_t, \xx_t, \zz_t$ denote the sequence iterates produced by the three operator splitting, let $\uu_t$ be defined as $\uu_t \defas \frac{1}{\gamma}(\yy_t - \zz_t)$ and let $\overline{\xx}_t \defas \frac{1}{t+1}\sum_{i=0}^t \xx_i, \overline{\uu}_t \defas \frac{1}{t+1}\sum_{i=0}^t \uu_i$ denote the ergodic (or average) sequence of iterates. Then for $\gamma=\frac{1}{L}$ we have the following suboptimality bound, valid for all $t \geq 0$, all $\xx, \uu$ in the domain of $\mathcal{L}$:
\begin{equation}
\mathcal{L}(\overline\xx_{t}, \uu) - \mathcal{L}(\xx, \overline\uu_{t}) \leq \frac{L\|\yy_0 - \xx - \gamma \uu\|^2}{2(t+1)}~.
\end{equation}
</p>
<div class="proof">
<p>The proof technique is the same as in our recent paper,<dt-cite key="pedregosa18a"></dt-cite> and consists very roughly on interpreting the three operator splitting as two consecutive applications of the proximal-gradient algorithm and then use known properties of this method. </p>
<p>The proof crucially relies on the following technical lemma that relates the output of a proximal-gradient step with its input and an arbitrary point in the domain:</p>
<p><b>Lemma 1 (three point inequality).</b> Let $\psi$ be convex and $L$-smooth and let $\eta$ be convex. If $\aa^+ = \prox_{\sigma \eta}(\aa - \sigma \nabla \psi(\aa))$, then for all $\bb$ in the domain of $\eta$ we have
\begin{equation}
\begin{aligned}
&\psi(\aa^+) + \eta(\aa^+) - \psi(\bb) - \eta(\bb) \\
&\qquad \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}\|\aa^+ - \aa\|^2
\end{aligned}\label{eq:three_point_inequality}
\end{equation}
</p>
<div class="wrap-collabsible"> <input id="collapsible3" class="toggle" type="checkbox"> <label for="collapsible3" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof">
<p>By $L$-smoothless and convexity respectively we have the inequalities
\begin{aligned}
\psi(\aa^+) &\leq \psi(\aa) + \langle \nabla \psi(\aa), \aa^+ - \aa\rangle + \frac{L}{2}\|\aa^+ - \aa\|^2\\
\psi(\aa) &\leq \psi(\bb) + \langle \nabla \psi(\aa), \aa - \bb\rangle~.
\end{aligned}
Adding them together gives the inequality
\begin{equation}
\psi(\aa^+) \leq \psi(\bb) + \langle \nabla \psi(\aa), \aa^+ - \bb\rangle + \frac{L}{2}\|\aa^+ - \aa\|^2
\end{equation}
</p>
<p>By the definition of proximal operator $\aa^+$ is characterized by the first order optimality conditions
\begin{equation}
0 \in \frac{1}{\gamma}(\aa^+ - \aa) + \nabla\psi(\aa) + \partial \eta(\aa^+)
\end{equation}
from where $\frac{1}{\gamma}(\aa - \aa^+) - \nabla\psi(\aa)\in \partial \eta(\aa^+)$ and so by convexity we have
\begin{equation}
\eta(\aa^+) - \eta(\bb) \leq \langle \frac{1}{\gamma}(\aa - \aa^+) - \nabla \psi(\aa), \aa^+ - \bb \rangle
\end{equation}
</p>
<p>Finally, adding both we obtain the claimed bound
$$
\psi(\aa^+) - \psi(\bb) + \eta(\aa^+) - \eta(\bb) \leq \frac{1}{\gamma}\langle \aa - \aa^+, \aa^+ - \bb \rangle + \frac{L}{2}\|\aa^+ - \aa\|^2~.
$$
</p>
</div></div></div></div>
<br />
<p>We will start the proof of this theorem by rewriting the algorithm in an equivalent but more convenient form. Using Moreau's decomposition for the proximal operator<dt-note key="moreau">The Moreau decomposition relates the proximal operator of a function $\varphi$ and that of its convex conjugate $\varphi^*$ as $$\prox_{\gamma \varphi}(\yy)= \yy - \gamma \prox_{\varphi^*/\gamma}(\yy/\gamma)\,.$$</dt-note> and the definition of $\uu_t$ the original formulation of the algorithm in Eq. \eqref{eq:z_update}-\eqref{eq:y_update} is equivalent to:
\begin{align}
\uu_t &= \prox_{h^*/\gamma}(\uu_{t-1} +\xx_{t-1}/\gamma)\\
\zz_t &= \xx_{t-1} + \gamma (\uu_{t-1} - \uu_t)\label{eq:pd_z_update}\\
\xx_t &= \prox_{\gamma g}(\zz_t - \gamma (\nabla f(\zz_t) + \uu_t))\\
\end{align}
</span>
</p>
<p>The next step is to use twice Lemma 1 to bound the saddle point suboptimality:
\begin{align}
\mathcal{L}(&\xx_t, \uu) - \mathcal{L}(\xx, \uu_t)\\
& = \overbrace{f(\xx_t) + \langle \xx_t - \xx, \uu_t\rangle - f(\xx)}^{\psi(\xx_t) - \psi(\xx)} + \overbrace{g(\xx_t) - g(\xx)}^{\eta(\xx_t) - \eta(\xx)} - \langle \xx_t, \uu_t - \uu\rangle - h^*(\uu) + h^*(\uu_t)\\
&\leq \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t - \xx\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2 + \langle \xx_{t-1} - \xx_t, \uu_t - \uu\rangle\nonumber\\
&\qquad\qquad - \overbrace{\langle \xx_{t-1}, \uu_t - \uu\rangle}^{\psi(\uu_t) - \psi(\uu)} + \overbrace{ h^*(\uu_t) - h^*(\uu)}^{\eta(\uu_t) - \eta(\uu)}\\
&\leq \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t - \xx\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2 + \langle \gamma(\uu_{t-1} - \uu_t) + \xx_{t-1} - \xx_{t}, \uu_t - \uu\rangle \\
&\stackrel{\eqref{eq:pd_z_update}}{=} \frac{1}{\gamma}\langle \zz_t - \xx_t, \xx_t + \gamma \uu_t - \xx - \gamma \uu\rangle + \frac{L}{2}\|\xx_t - \zz_t\|^2\\
&\stackrel{\eqref{eq:y_update}}{=} \frac{1}{\gamma}\langle \yy_{t} - \yy_{t+1}, \yy_{t+1} - \yy\rangle + \frac{L}{2}\|\yy_{t+1} - \yy_t\|^2~, \text{ with $\yy \defas \xx + \gamma \uu$}\\
&= \frac{1}{2\gamma}\|\yy_{t} - \yy\|^2 - \frac{1}{2\gamma}\|\yy_{t+1} - \yy\|^2 +\left( \frac{1}{2\gamma} - \frac{L}{2}\right) \|\yy_{t+1} - \yy_t\|^2~,
&
\end{align}
where in the first inequality we have used Lemma 1 with $\psi(\cdot) = f(\cdot) + \langle \cdot, \uu_t\rangle, \eta = g, \sigma = \gamma, \aa=\zz_t$ and in the second inequality we have used the same lemma but this time with $\psi(\cdot) = \langle \xx_{t-1}, \cdot\rangle$, $\eta = h^*$, $\sigma=1/\gamma, \aa=\uu_{t-1}$.
</p>
<p>Setting $\gamma=\frac{1}{L}$ makes the last term vanish and adding this last inequality from $0$ to $t$ we obtain
\begin{align}
\sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i) &\leq \frac{L}{2}\|\yy_0 - \yy\|^2 - \frac{L}{2}\|\yy_{t+1} - \yy\|^2\\
& \leq \frac{L}{2}\|\yy_0 - \yy\|^2\label{eq:sum_suboptimality}
\end{align}
</p>
<p>Finally, $\mathcal{L}(\xx_i, \uu)$ is a convex function of $\xx_i$ and $- \mathcal{L}(\xx, \uu_i)$ is also a convex function of $\uu_i$. Applying <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality#Finite_form">Jensens inequality</a> to this sum gives</span>
\begin{align}
\sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i) &= (t+1)\frac{1}{t+1}\sum_{i=0}^t \mathcal{L}(\xx_i, \uu) - \mathcal{L}(\xx, \uu_i)\\
&\geq (t+1)\left(\mathcal{L}(\overline{\xx}_i, \uu) - \mathcal{L}(\xx, \overline{\uu}_i)\right)~.
\end{align}
Combining this with \eqref{eq:sum_suboptimality} and dividing both sides by $(t+1)$ we obtain
\begin{equation}
\mathcal{L}(\overline{\xx}_t, \uu) - \mathcal{L}(\xx, \overline{\uu}_t) \leq \frac{L\|\yy_0 - \yy\|^2}{2(t+1)}~,
\end{equation}
which is the desired bound.
</p>
</div>
<p>A couple of remarks on this theorem. First, the bound is given in terms of saddle point suboptimality. This is a bound that holds <i>for all</i> $\xx, \uu$, which might seem strange as convergence bounds in convex optimization typically only involve the current and optimal iterate. As highlighted in (Gidel 2017),<dt-cite key="gidel17a"></dt-cite>
a suboptimality criterion that only involves the current and optimal iterate like $\mathcal{L}(\xx_t, \uu^\star) - \mathcal{L}(\xx^\star, \uu_t)$, where $(\xx^\star, \uu^\star)$ is a saddle point, is not meaningful for saddle point convex problems as it can be zero for a point that is <i>not</i> a saddle point. A bound that holds for all $(\xx, \uu)$ is crucial for a meaningful analysis and these bounds are not uncommon in the analysis of primal-dual methods.<dt-cite key="chambolle2016ergodic"></dt-cite> <dt-cite key="malitsky2018first"></dt-cite></p>
<p>Second, by definition of <a href="https://en.wikipedia.org/wiki/Convex_conjugate">convex conjugate</a>, maximizing $\mathcal{L}$ over $\uu$ we recover the primal objective: $\sup_\uu \mathcal{L}(\xx, \uu)$ $= f(\xx) + g(\xx) + \sup_\uu \left\{\langle \xx, \uu\rangle - h^*(\uu)\right\}$ $= f(\xx) + g(\xx) + h(\xx)$. Since the bound in the previous theorem is valid for all $\uu$ in the domain, we can maximize over this variable to obtain a bound in terms of the primal objective suboptimality. Unfortunately, the value of $\uu$ that maximizes the saddle point objective for a fixed $\xx$ is not necessarily bounded, rendering the bound meaningless. However, assuming $\beta_h$-Lipschitz continuity of $h$ and using known results in duality theory that related Lipschitz continuity to bounded domain we can obtain the following result:</span></p>
<p style="border: 1px black solid; padding: 20px"><b>Corollary 1</b>. Let $P(\xx) \defas f(\xx) + g(\xx) + h(\xx)$ denote the objective function and $\xx^\star$ any solution to \eqref{eq:opt_objective}. If $h$ is $\beta_h$-Lipschitz then for the step-size $\gamma=\frac{1}{L}$ we have
\begin{equation}
P(\overline\xx_t) - P(\xx^\star) \leq \frac{L\|\yy_0 - \xx\|^2 + \beta_h^2/L}{t+1}
\end{equation}
</p>
<div class="wrap-collabsible"> <input id="collapsible4" class="toggle" type="checkbox"> <label for="collapsible4" class="lbl-toggle" tabindex="0"><b>Show proof</b></label><div class="collapsible-content"><div class="content-inner"><div class="proof">
<p>
Let $\widehat\uu_t \defas \argmin_{\uu} \mathcal{L}(\overline\xx_{t}, \uu)$ and $(\xx^\star, \uu^\star)$ be a saddle point of $\mathcal{L}$. Then $\mathcal{L}(\overline\xx_{t}, \widehat\uu_t) = P(\overline\xx_{t})$ and $\mathcal{L}(\xx^\star, \uu^\star) = P(\xx^\star)$ by definition of convex conjugate.
Using this and the previous theorem we can write the following set of inequalities
\begin{align}
&P(\overline\xx_{t}) - P(\xx^\star) = \mathcal{L}(\overline\xx_{t}, \widehat\uu_t) - \mathcal{L}(\xx^\star, \uu^\star)\\
&\leq \mathcal{L}(\overline\xx_{t}, \widehat\uu_t) - \mathcal{L}(\xx^\star, \widehat\uu_t)\\
&\quad \text{ (definition of saddle point, Eq. \eqref{eq:saddle_point} with $\xx=\xx^\star$) }\nonumber\\
&\leq \frac{L}{2 (t+1)} \|\yy_0 - \xx^\star - \gamma \widehat\uu_t\|^2\label{eq:primal_subopt_proof_bound}\\
&\quad\text{ (Theorem 1 with $\xx = \xx^\star, \uu = \widehat\uu_t$)}\nonumber
\end{align}
The $\beta_h$-Lipschitz assumption on $h$ implies that the norm of every element in the domain of $h^*$ is bounded by $\beta_h$, see e.g. Corollary 13.3.3 in Rockafellar 1970.<dt-cite key="rockafellar1970convex"></dt-cite> This way we bound
\begin{align}
\|\yy_0 - \xx^\star - \gamma\widehat\uu_t\|^2 &\leq 2 \|\yy_0 - \xx^\star\|^2 + 2 \gamma^2\| \widehat\uu_t\|^2\\
&\leq 2 \|\yy_0 - \xx^\star\|^2 + 2 \gamma^2 \beta_h^2~.
\end{align}
Plugging this bound into Eq. \eqref{eq:primal_subopt_proof_bound} and using $\gamma=\frac{1}{L}$ we obtain the claimed bound
\begin{equation}
P(\overline\xx_{t}) - P(\xx^\star) \leq \frac{L\|\yy_0 - \xx^\star\|^2 + \beta_h^2/L}{t+1}~.
\end{equation}
</p>
</div></div></div></div>
<p>It is possible to derive stronger convergence rates under stronger assumptions, like linear convergence under strong convexity of the smooth term and smoothness of one of the proximal terms. Proofs of this result can be found in (Davis and Yin, 2015)<dt-cite key="davis2015three"></dt-cite> and (Pedregosa and Gidel, 2018).<dt-cite key="pedregosa18a"></dt-cite></p>
<br />
<h2 id="code">Code</h2>
<p>I maintain a (very much work-in-progress) implementation of the algorithm in the <a href="http://openopt.github.io/copt/">C-OPT package</a>. <a href="http://openopt.github.io/copt/generated/copt.minimize_TOS.html#copt.minimize_TOS">Here</a> is the function reference and <a href="http://openopt.github.io/copt/auto_examples/plot_sparse_nuclear_norm.html">here</a> and <a href="http://openopt.github.io/copt/auto_examples/plot_sparse_nuclear_norm.html#sphx-glr-auto-examples-plot-sparse-nuclear-norm-py">here</a> are a couple of examples.</p>
<h2 id="openquestions">Open questions</h2>
<p>As of August 2018, there are still many open questions surrounding the three operator splitting, here is a short list biased towards my own interests. If you come from the future and some have been answered I would love to hear about it in the comments. </p>
<p>
<ul style="padding-left: 1em;">
<li><b>Non-convex objectives.</b> Given that proximal gradient descent converges to a stationary point even when the smooth term is non-convex, it seems reasonable to think that the three operator splitting exhibits a similar behavior. </li>
<li><b>Linear convergence under weaker assumptions</b>. In practice I have always observed an empirical linear convergence of the method on objectives with a strongly convex smooth term and non-smooth proximal terms, yet the strongest known results to date require smoothness of one of the proximal terms to obtain a linear convergence rate. <u><b>Update December 2018</b></u>: Ryu and coauthors<dt-cite key="ryu2018operator"></dt-cite> have given a negative answer to this conjecture. The authors have developed a framework for obtaining tight convergence rates of a large class of splitting methods that include the three operator splitting. Their results show that is impossibility to obtain a linear convergence rate under assumptions of convexity of $g$ and $h$ and smoothness and strong convexity of $f$.</li>
<li><b>Practical acceleration</b>. Davis and Yin<dt-cite key="davis2017three"></dt-cite> propose an accelerated variant of the method which requires to know the strong convexity parameter of the objective. However, this parameter is typically unknown, and so the question arises of whether it is possible to design an accelerated variant which (as FISTA) does not require to know the strong convexity parameter.</li>
</ul>
</p>
<hr />
<h3 id="refs">References</h3>
<figure class="fullwidth">
<div id="references">
</div>
</figure>
Notes on the Frank-Wolfe Algorithm, Part I2018-03-21T00:00:00+01:002018-03-21T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2018-03-21:/blog/2018/notes-on-the-frank-wolfe-algorithm-part-i/
<p>This blog post is the first in a series discussing different theoretical and practical aspects of the Frank-Wolfe algorithm.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- for highlighting -->
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css">
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<div style="display: none">
$$
\def\xx{\boldsymbol x}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\dd …</div>
<p>This blog post is the first in a series discussing different theoretical and practical aspects of the Frank-Wolfe algorithm.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- for highlighting -->
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css">
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<div style="display: none">
$$
\def\xx{\boldsymbol x}
\def\yy{\boldsymbol y}
\def\ss{\boldsymbol s}
\def\dd{\boldsymbol d}
\DeclareMathOperator*{\argmin}{{arg\,min}}
\DeclareMathOperator*{\argmax}{{arg\,max}}
\DeclareMathOperator*{\minimize}{{minimize}}
\DeclareMathOperator*{\diam}{{diam}}
\def\RR{\mathbb R}
$$
</div>
<p class="framed">
<b>Outline:</b><br />
<span style="margin-left: 20px"><a href="#intro">The Frank-Wolfe Algorithm</a></span><br />
<span style="margin-left: 20px"><a href="#example">Example: using Frank-Wolfe to solve a Lasso problem</a></span><br />
<span style="margin-left: 20px"><a href="#convergence">Convergence Theory</a></span><br />
</p>
<h2>The Frank-Wolfe Algorithm</h2>
<p id="intro">The Frank-Wolfe (FW)<label for="FW-ref" class="margin-toggle sidenote-number"></label><input type="checkbox" id="FW-ref" class="margin-toggle" /><span class="sidenote">Originally published as Frank, Marguerite, and Philip Wolfe. <a href="http://dx.doi.org/10.1002/nav.3800030109">"An algorithm for quadratic programming."</a> Naval Research Logistics (1956). See also Jaggi, Martin. <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">"Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization."</a> ICML 2013 for a more recent exposition.</span>
or conditional gradient algorithm
is one of the oldest methods for nonlinear constrained optimization and has seen an impressive revival in recent years due to its low memory requirement and projection-free iterations. It can solve problems of the form </p>
<p style="background-color: #D2E4FC; padding: 1px; border-radius: 8px;">\begin{equation}\label{eq:fw_objective}
\minimize_{\boldsymbol{x} \in \mathcal{C}} f(\boldsymbol{x}) ~,
\end{equation}</p>
<p>where $f$ is differentiable with $L$-Lipschitz gradient<label for="smooth" class="margin-toggle sidenote-number"></label><input type="checkbox" id="smooth" class="margin-toggle" /><span class="sidenote">This is a very standard assumption in optimization, which can be intuitively interpreted as that the objective function must be "smooth", i.e., cannot have any kinks or discontinuities.</span> and the domain $\mathcal{C}$ is a convex and compact set.<label for="smooth" class="margin-toggle sidenote-number"></label><input type="checkbox" id="smooth" class="margin-toggle" /><span class="sidenote"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Convex_polygon_illustration1.svg/1024px-Convex_polygon_illustration1.svg.png" width="80%" class="center" alt="convex set">A convex set is one for which any segment between two points lies within the set. While the FW algorithm does not require the objective function $f$ to be convex, it <i>does</i> require the domain to be a convex set.</span></p>
<p>Frank-Wolfe is a remarkably simple algorithm that given an initial guess $\boldsymbol{x}_0$ constructs a sequence of estimates $\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots$ that converges towards a solution of the optimization problem. The algorithm is defined as follows:
</p>
<p class="framed">\begin{align}
&\textbf{Input}: \text{initial guess $\xx_0$, tolerance $\delta > 0$}\nonumber\\
& \textbf{For }t=0, 1, \ldots \textbf{ do } \\
&\quad\boldsymbol{s}_t \in \argmax_{\boldsymbol{s} \in \mathcal{C}} \langle -\nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle\label{eq:lmo}\\
&\quad \boldsymbol{d}_t = \ss_t - \xx_t\\
&\quad g_t = \langle - \nabla f(\xx_t), \dd_t \rangle\\
&\quad \textbf{If } g_t < \delta: \\
&\quad\qquad\hfill\text{// exit if gap is below tolerance }\nonumber\\
&\quad\qquad\textbf{return } \xx_t\\
&\quad {\textbf{Variant 1}}: \text{set step size as} \nonumber\\
&\quad\qquad\gamma_t = \vphantom{\sum_i}\min\Big\{\frac{g_t}{L\|\dd_t\|^2}, 1 \Big\}\label{eq:step_size}\\
&\quad \textbf{Variant 2}: \text{set step size by line search}\nonumber\\
&\quad\qquad\gamma_t = \argmin_{\gamma \in [0, 1]} f(\xx_t + \gamma \boldsymbol{d}_t)\label{eq:line_search}\\
&\quad\boldsymbol{x}_{t+1} = \boldsymbol{x}_t + \gamma_t \boldsymbol{d}_t~.\label{eq:update_rule}\\
&\textbf{end For loop}\\
& \textbf{return } \xx_t
\end{align}
</p>
<p>Contrary to other constrained optimization algorithms like projected gradient descent, the Frank-Wolfe algorithm does not require access to a projection, hence why it is sometimes referred to as a projection-free algorithm. It instead relies on a routine that solves a linear problem over the domain (Eq. \eqref{eq:lmo}). This routine is commonly referred to as a <i>linear minimization oracle</i>.<label for="lmo-minmax" class="margin-toggle sidenote-number"></label><input type="checkbox" id="lmo-minmax" class="margin-toggle" /><span class="sidenote">We defined it as a maximization to emphasize its intuitive meaning as the element that correlates the most with the steepest descent (the negative gradient). The names comes from the fact that other references define it equivalently as the minimization $$\boldsymbol{s}_t \in \argmin_{\boldsymbol{s} \in \mathcal{C}} \langle \nabla f(\boldsymbol{x}_t), \boldsymbol{s}\rangle~.$$</span></p>
<p>The rest of the algorithm mostly concerns finding the appropriate step size to move in the direction dictated by the linear minimization oracle $\dd_t = \ss_t - \xx_t$, where $\ss_t$ is the result of the linear minimization oracle. Among the many different step size rules that the FW algorithm admits, we detail two as <b>Variant 1</b> and <b>Variant 2</b>. The first variant is easy to compute and only relies on knowledge of (a lower bound on) the Lipschitz constant $L$. The second variant instead allows to make more progress but requires to solve a 1-dimensional problem at each iteration. In some cases, such as when $f$ is a least squares loss and $\mathcal{C}$ is the $\ell_1$ norm ball, the step size selection problem has a closed form optimal solution, in which case this approach (Variant 2) should be preferred. However, in the general case this does not have a closed form solution, in which case the first variant should be preferred.
</p>
<p>
</p>
<figure>
<label for="description_FW" class="margin-toggle">⊕</label><input id="description_FW" class="margin-toggle" type="checkbox"><span class="marginnote"><span style="text-decoration: underline">Side figure</span>: Frank-Wolfe on a toy 2-dimensional problem, in which the triangle is the domain $\mathcal{C}$ and the level curves represent values of the objective function $f$.
Starting from an initial guess $\xx_0 \in \mathcal{C}$, the Frank-Wolfe algorithm select $\ss_0$, the element in the set that is most correlated with the negative gradient $-\nabla f(\xx_0)$ (Eq \eqref{eq:lmo}). This is always an extremal element of the set and ensures that $\ss_0 - \xx_0$ is a descent direction. The next iterate $\xx_1$ is computed by moving the current iterate along the descent direction $\ss_0 - \xx_0$ by a step size $\gamma_0$. <br />Image adapted from <a href="https://twitter.com/gabrielpeyre/status/945210545166258176">Gabriel Peyre</a>, (<a href="https://github.com/mathematical-tours/mathematical-tours.github.io/blob/master/tweets-sources/codes/frank-wolfe/FrankWolfe.m">code</a>).</span>
<img src="/images/2018/FW_iterates.png" alt="Frank-Wolfe algorithm on a toy problem" class="center">
</figure>
<p>One can see the Frank-Wolfe algorithm is as an algorithm that solves a potentially non-linear problem by solving a sequence of linear ones. The effectiveness of this approach is then tightly linked to the ability to quickly solve the linear subproblems. As it turns out, for a large class of problems, of which the $\ell_1$ or nuclear (also known as trace) norm ball are the most widely known examples, the linear subproblems have either a closed form solution or efficient algorithms exist.<label for="Jaggi2013" class="margin-toggle sidenote-number"></label><input type="checkbox" id="Jaggi2013" class="margin-toggle" /><span class="sidenote">For an extensive discussion of the cost the linear minimization oracle, see Jaggi, Martin. <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">"Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization."</a> ICML 2013.</span>
Compared to a projection, the use of a linear minimization oracle has other important consequences. For example, the output of this linear minimization oracle is always a vertex of the domain<label for="vertex_domain" class="margin-toggle sidenote-number"></label><input type="checkbox" id="vertex_domain" class="margin-toggle" /><span class="sidenote">By the properties of <a href="https://en.wikipedia.org/wiki/Linear_programming#Optimal_vertices_(and_rays)_of_polyhedra">linear programs</a>, the optimum value is aways attained on the boundary.</span> and so by the update rule \eqref{eq:update_rule} the iterate is expressed as a convex combination of vertices. This feature can be very advantageous in situations with a huge or even infinite number of features, such as architecture optimization in neural networks<label for="ping2016learning" class="margin-toggle sidenote-number"></label><input type="checkbox" id="ping2016learning" class="margin-toggle" /><span class="sidenote">Ping W, Liu Q, Ihler AT. <a href="http://papers.nips.cc/paper/6342-learning-infinite-rbms-with-frank-wolfe.pdf">"Learning infinite RBMs with Frank-Wolfe"</a>, Advances in Neural Information Processing Systems (2016).</span> or estimation of an infinite-dimensional sparse matrix arising in multi-output polynomial network.<label for="blondel2017" class="margin-toggle sidenote-number"></label><input type="checkbox" id="blondel2017" class="margin-toggle" /><span class="sidenote">Blondel M, Niculae V, Otsuka T, Ueda N. <a href="http://papers.nips.cc/paper/6927-multi-output-polynomial-networks-and-factorization-machines.pdf">"Multi-output Polynomial Networks and Factorization Machines"</a>, Advances in Neural Information Processing Systems 2017.</span>
</p>
<p>There are other step size strategies that I did not mention. For example, the step size can also be set as $\gamma_t = 2/(t+2)$. This is an "oblivious" step size, in that it doesn't depend on any quantity arising from the optimization. As such, it does not perform competitively in practice with the other step size strategies, although it does achieve the same theoretical rate of convergence. Another option, developed by Demyanov and Rubinov<label for="demyanov1970" class="margin-toggle sidenote-number"></label><input type="checkbox" id="demyanov1970" class="margin-toggle" /><span class="sidenote">Demyanov, Vladimir and Rubinov, Aleksandr <a href="https://doi.org/10.1002/zamm.19730530723">"Approximate Methods in Optimization Problems"</a>. Elsevier (1970). This is an excellent book, but unfortunately it is impossible to find online.</span> and similar to Variant 1 is
\begin{equation}
\gamma_t = \min\Big\{\frac{g_t}{L\,\diam(\mathcal{C})^2}, 1 \Big\}~,\label{eq:step_size_diam}
\end{equation}
where $\diam$ denotes the diameter with respect to the euclidean norm.<label for="euclidean" class="margin-toggle sidenote-number"></label><input type="checkbox" id="euclidean" class="margin-toggle" /><span class="sidenote">It is possible to use a non-euclidean norm too, as long as the Lipschitz constant $L$ is computed with respect to the same norm. For simplicity we will stick to the euclidean norm.</span>
However, since we always have $\|\xx_t - \ss_t\|^2 \leq \diam(\mathcal{C})^2$ by definition of diameter, the step sizes provided by this variant are always smaller than those of Variant 1 and gives a worse empirical convergence. A further improved on this step size consists in replacing the Lipschitz constant $L$ by a local estimate that can potentially be much smaller, allowing for larger step sizes. This approached is developed in our recent paper.
<label for="pedregosa2018stepsize" class="margin-toggle sidenote-number"></label><input type="checkbox" id="pedregosa2018stepsize" class="margin-toggle" /><span class="sidenote">Pedregosa, Fabian and Askari, Armin and Negiar, Geoffrey and Jaggi, Martin (2018) <a href="https://arxiv.org/pdf/1806.05123.pdf">"Step-Size Adaptivity in Projection-Free Optimization"</a>. <i>arXiv:1806.05123</i></span>
</p>
<p>Yet another step size strategy that has been proposed is based on the notion of <i>curvature constant</i>. The curvature constant $C_f$ is defined as
\begin{equation}
C_f = \sup_{\substack{\xx,\ss\in\mathcal{C},\gamma\in [0,1] \\ \yy=\xx+\gamma (\ss-\xx)}} \frac{2}{\gamma^2}(f(\yy) - f(\xx) - \langle \yy - \xx, \nabla f(\xx)\rangle)
\end{equation}
The curvature constant is closely related to our Lipschitz assumption on the gradient. In particular, by the definition above we always have $C_f \leq \diam(\mathcal{C})^2 L$, which given \eqref{eq:step_size_diam} suggests the following rule for the step size:
\begin{equation}
\gamma_t = \min\Big\{\frac{g_t}{C_f}, 1 \Big\}~.\label{eq:step_size_curvature}
\end{equation}
This step size was used for example by Lacoste-Julien 2016.<label for="lacoste-julien2016" class="margin-toggle sidenote-number"></label><input type="checkbox" id="lacoste-julien2016" class="margin-toggle" /><span class="sidenote">Lacoste-Julien, Simon. <a href="https://arxiv.org/pdf/1607.00345.pdf">"Convergence rate of Frank-Wolfe for non-convex objectives."</a> <i>arXiv preprint arXiv:1607.00345</i> (2016).</span> Note that all the results in this post are in terms of the Lipschitz constant $L$ but analogous results exist in terms of this curvature constant. The obtained rates using this curvature constant are typically tighter, however, they lead to less practical step sizes, since this constant is rarely known in practice.
</p>
<h2 id="example">Example: using Frank-Wolfe to solve a Lasso problem</h2>
<p>Some aspects of the algorithm will become clearer with a concrete example. Lets consider a least squares problem with an $\ell_1$ constraint, a problem known as <i>the Lasso</i>. Given a data matrix $\boldsymbol{A} \in \RR^{n \times p}$, a target variable $\boldsymbol{b} \in \RR^n$, and a regularization parameter $\alpha$, this is a problem of the form \eqref{eq:fw_objective} with
\begin{equation}
f(\xx) = \frac{1}{2}\|\boldsymbol{A}\xx - \boldsymbol{b}\|^2~,\quad \mathcal{C} = \{\xx : \|\xx\|_1\leq \alpha\}
\end{equation}
In this case, the domain is a polytope and its vertices are $\{\alpha e_1, -\alpha e_1, \alpha e_2, -\alpha e_2, \ldots, \alpha e_p, -\alpha e_p\}$,
where $e_i$ is the $i$-th element of the canonical basis, i.e., the vector that is zero everywhere except in the $i$-th coordinate, in which it equals one.<label for="L1Ball" class="margin-toggle sidenote-number"></label><input type="checkbox" id="L1Ball" class="margin-toggle" /><span class="sidenote">In a 2-dimensional space, the $\ell_1$ ball of radius $\alpha$ is the convex hull of its 4 vertices $\{\alpha e_1, -\alpha e_1, \alpha e_2, -\alpha e_2\}$, where $e_1 = (1, 0)$ and $e_2 = (0, 1)$ <img class="center" src="/images/2018/L1_ball.png" alt=""> Similarly, in a $p$-dimensional space, the $\ell_1$ ball is the convex hull of $\{\alpha e_1, -\alpha e_1, \alpha e_2, -\alpha e_2, \ldots, \alpha e_p, -\alpha e_p\}$ </span>
</p>
<p>
The Lipschitz constant of $\nabla f$ is easy to compute, but in this precise case we can do even better, since as we will see the line search (Variant 2) option has a closed form expression. The objective function of the line search \eqref{eq:line_search} is of the form
\begin{equation}
f(\xx_t + \gamma \dd_t) = \frac{1}{2}\|\gamma \boldsymbol{A}(\ss_t - \xx_t) + \boldsymbol{A}\xx_t - \boldsymbol{b}\|^2~,
\end{equation}
and deriving with respect to $\gamma$ we can easily verify that the following step size solves the line search problem
\begin{equation}
\gamma_t = \frac{\boldsymbol{q_t}^T (\boldsymbol{b} - \boldsymbol{A} \xx)}{\|\boldsymbol{q}_t\|^2}~,
\end{equation}
with $\boldsymbol{q}_t = \boldsymbol{A}(\boldsymbol{s}_t - \xx_t)$. Below is an example in Python of the Frank-Wolfe algorithm in this case, applied to a synthetic dataset.
<span class="marginnote">
This simple implementation takes around 20 seconds to solve a 10.000 $\times$ 10.0000 problem (although the emphasis of this implementation is on clarity and not speed) and produces the following output:
<img src="/images/2018/frank_wolfe_lasso_gap.png" alt="">
Which shows the decrease in the Frank-Wolfe gap as a function of the number of iterations. A couple of comments. First note that the gap is not monotonically decreasing. Second, although the decrease in this case seems (roughly) exponential, we can in general <i>not</i> guarantee an exponential (also known as linear) decrease rate for the Frank-Wolfe algorithm. This will only be the case for other variants such as the Away-steps Frank-Wolfe that we will discuss in upcoming posts.
</span>
</p>
<figure>
<pre><code class="python">
import numpy as np
from scipy import sparse
# .. for plotting ..
import pylab as plt
# .. to generate a synthetic dataset ..
from sklearn import datasets
n_samples, n_features = 10000, 10000
A, b = datasets.make_regression(n_samples, n_features)
def FW(alpha, max_iter=100, tol=1e-8, callback=None):
# .. initial estimate, could be any feasible point ..
x_t = sparse.dok_matrix((n_features, 1))
# .. some quantities can be precomputed ..
Atb = A.T.dot(b)
for it in range(max_iter):
# .. compute gradient. Slightly more involved than usual because ..
# .. of the use of sparse matrices ..
Ax = x_t.T.dot(A.T).ravel()
grad = (A.T.dot(Ax) - Atb)
# .. the LMO results in a vector that is zero everywhere except for ..
# .. a single index. Of this vector we only store its index and magnitude ..
idx_oracle = np.argmax(np.abs(grad))
mag_oracle = alpha * np.sign(-grad[idx_oracle])
d_t = -x_t.copy()
d_t[idx_oracle] += mag_oracle
g_t = - d_t.T.dot(grad).ravel()
if g_t <= tol:
break
q_t = A[:, idx_oracle] * mag_oracle - Ax
step_size = min(q_t.dot(b - Ax) / q_t.dot(q_t), 1.)
x_t += step_size * d_t
if callback is not None:
callback(g_t)
return x_t
# .. plot evolution of FW gap ..
trace = []
def callback(g_t):
trace.append(g_t)
sol = FW(.5 * n_features, callback=callback)
plt.plot(trace / trace[0], lw=3)
plt.yscale('log')
plt.xlabel('Number of iterations')
plt.ylabel('Relative FW gap')
plt.title('FW on a Lasso problem')
plt.xlim((0, 100))
plt.grid()
plt.show()
density = np.mean(sol.toarray().ravel() != 0)
print('Density of solution: %s%%' % (density * 100))
</code></pre>
</figure>
<h2 id="convergence">Convergence Theory</h2>
<p>The Frank-Wolfe algorithm converges under very mild assumptions. As we will see, not even convexity of the objective is necessary to obtain weak convergence guarantees. As before, I will assume without explicit mention that $f$ is differentiable with $L$-Lipschitz gradient and $\mathcal{C}$ is a convex and compact set.
</p>
<p>In this part I will present two main convergence results: one for general objectives and one for convex objectives. For simplicity I assume that the linear subproblems are solved exactly, but these proofs can easily be extended to consider approximate linear minimization oracles. These proofs can be found for example in <label for="pedregosa2018stepsize" class="margin-toggle sidenote-number"></label><input type="checkbox" id="pedregosa2018stepsize" class="margin-toggle" /><span class="sidenote">Pedregosa, Fabian et al. <a href="https://arxiv.org/pdf/1806.05123.pdf">"Step-size adaptivity in Projection-Free Optimization"</a> ArXiv:1806.05123 (2018).</span>.</p>
<p>The remainder of the section is structured as follows: I first introduce two key definition and technical Lemma, and finally prove the convergence results.</p>
<p>
<b>Definition 1: Stationary point</b>. We will say that $\xx^\star \in \mathcal{C}$ is a stationary point if
$$
\langle \nabla f(\xx^\star), \xx - \xx^\star \rangle \geq 0~\text{ for all $\xx \in \mathcal{C}$}~.$$
The intuitive meaning of this definition is that $\xx^\star$ is a stationary point if every direction in the polytope with origin at $\xx^\star$
is positively correlated with the gradient. Said otherwise, $\xx^\star$ is a stationary point if there are no feasible descent directions with origin at $\xx^\star$.
</p>
<figure>
<label for="mn-exports-imports" class="margin-toggle">⊕</label><input id="mn-exports-imports" class="margin-toggle" type="checkbox"><span class="marginnote"><span style="text-decoration: underline">Side figure</span>: $\xx^\star$ is a stationary point if there are no feasible descent directions with origin at $\xx^\star$.</span>
<img src="/images/2018/FW_optimality.png" alt="FW optimality conditions" class="center">
</figure>
<p>
<b>Definition 2: Frank-Wolfe gap</b>. We denote by $g_t$ the Frank-Wolfe gap, defined as
\begin{equation}
g_t = \langle \nabla f(\xx_t), \xx_t - \ss_t \rangle
\end{equation}
</p>
<p>Note that by the definition of $\ss_t$ in \eqref{eq:lmo} we always have $\langle \nabla f(\xx_t), \ss_t\rangle \leq \langle \nabla f(\xx_t), \xx_t\rangle$ and so the Frank-Wolfe gap is always non-negative, and zero only at a stationary point. This makes it a good criterion to measure distance to a stationary point, and in fact convergence results for general (i.e., potentially non-convex) objectives will be given in terms of this quantity.</p>
<p>When $f$ is convex we also have that the FW gap verifies
\begin{align}\label{eq:convexity_fw_gap}
g_t &= \max_{\ss \in \mathcal{C}}\langle \nabla f(\xx_t), \xx_t - \ss\rangle \\
&\geq \langle \nabla f(\xx_t), \xx_t - \xx^\star\rangle\\
& \geq f(\xx_t) - f(\xx^\star)
\end{align}
where the last inequality follows from the definition of convexity<label for="convexity" class="margin-toggle sidenote-number"></label><input type="checkbox" id="convexity" class="margin-toggle" /><span class="sidenote">A differentiable function is said to be convex if $f(\yy) \geq f(\xx) + \langle \nabla f(\xx), \yy - \xx\rangle$ for all $\xx, \yy$ in the domain.</span>
and so can be used as a function suboptimality certificate.</p>
<p>The next lemma relates the objective function value at two consecutive iterates and will be key to prove convergence results, both for convex and non-convex objectives. Given its usefulness in the following I will name it "Key recursive inequality".</p>
<p id="key-recursive"><b>Lemma 1: Key recursive inequality</b>. Let $\{\xx_0, \xx_1, \ldots\}$ be the iterates produced by the Frank-Wolfe algorithm (in either variants). Then we have the following inequality, valid for any $\xi \in [0, 1]$:
\begin{equation}
f(\xx_{t+1}) \leq f(\xx_t) - \xi g_t + \frac{1}{2}\xi^2 L \diam(\mathcal{C})^2
\end{equation}
</p>
<div class="proof">
<p>
A consequence of the Lipschitz gradient assumption on $f$ is that we can upper bound the function $f$ at every point $\yy \in \mathcal{C}$ by the following quadratic:<label for="L-smooth-inequality" class="margin-toggle sidenote-number"></label><input type="checkbox" id="L-smooth-inequality" class="margin-toggle" /><span class="sidenote">The Lipschitz gradient assumption in fact implies the stronger inequality $|f(\yy) - f(\xx) - \langle \nabla f(\xx), \yy-\xx\rangle|$ $\leq$ $\frac{L}{2}\|\xx - \yy\|^2$, see e.g. Lemma 1.2.3 in Nesterov's <a href="http://www.springer.com/us/book/9781402075537">Introductory lectures on convex optimization</a>. This implies the existance of <i>both</i> a quadratic upper bound and lower bound at $f(\yy)$, while in the proof we only use the upper bound. Hence, the Lipschitz assumption could be relaxed to that of having a quadratic upper bound. </span></label>
\begin{equation}
f(\yy) \leq f(\xx) + \langle \nabla f(\xx), \yy - \xx\rangle + \frac{L}{2}\|\yy - \xx\|^2
\end{equation}
We can apply this inequality, valid for any $\xx$, $\yy$ in the domain, to the special case $\xx = \xx_t$, $\yy = (1 - \gamma)\xx_{t} + \gamma \ss_t$ with $\gamma \in [0, 1]$ so that $\yy$ remains in the domain, and so we have
\begin{align}
f((1 - \gamma)\xx_{t} + \gamma \ss_t) &\leq f(\xx_t) + \gamma\overbrace{\langle\nabla f(\xx_t), \ss_{t} - \xx_t \rangle}^{- g_t} \nonumber\\
&\qquad+ \frac{L \gamma^2}{2}\|\ss_t - \xx_t\|^2~.\label{eq:l_smooth_xt}
\end{align}
We will now minimize the right hand side with respect to $\gamma \in [0, 1]$. This is a quadratic function of $\gamma$ and its minimum, which we denote $\gamma_t^\star$ is given by
\begin{equation}
\gamma_t^\star = \min\Big\{\frac{ g_t}{L\|\xx_t - \ss_t\|^2}, 1 \Big\}~.
\end{equation}
We now use the value $\gamma=\gamma_t^\star$ in the inequality \eqref{eq:l_smooth_xt} to get the following sequence of inequalities:
\begin{align}
&f((1 - \gamma_t^\star)\xx_{t} + \gamma_t^\star \ss_t) \\
&\leq f(\xx_t) - \gamma_t^\star g_t + \frac{L (\gamma_t^\star)^2}{2}\|\ss_t - \xx_t\|^2\\
&= f(\xx_t) + \min_{\xi \in [0, 1]}\left\{-\xi g_t + \frac{L \xi^2}{2}\|\ss_t - \xx_t\|^2\right\}\\
&\qquad \text{ (by optimality of $\gamma_t^\star$)}\nonumber\\
&\leq f(\xx_t) - \xi g_t + \frac{L \xi^2}{2}\|\ss_t - \xx_t\|^2\quad \text{ (for any $\xi \in [0, 1]$)}\nonumber\\
&\leq f(\xx_t) - \xi g_t + \frac{L \xi^2}{2}\diam(\mathcal{C})^2\label{eq:recusive_rhs_final}~.
\end{align}
The right hand side of the above inequality already contains the terms claimed in the Lemma. We will now bound the right hand side. For Variant 1 of the algorithm we have $f(\xx_{t+1}) = f((1 - \gamma_t^\star)\xx_{t} + \gamma_t^\star \ss_t)$, since $\gamma_t$ and $\gamma_t^\star$ coincide in this case. For Variant 2 we have $f(\xx_{t+1}) \leq f((1 - \gamma_t^\star)\xx_{t} + \gamma_t^\star \ss_t)$ since by definition of line search $f(\xx_{t+1})$ is the point that minimizes the objective value in the segment $(1 - \gamma)\xx_{t} + \gamma \ss_t$. Hence, in either case we have
\begin{equation}
f(\xx_{t+1}) \leq f((1 - \gamma_t^\star)\xx_{t} + \gamma_t^\star \ss_t)
\end{equation}
Chaining this last inequality with Eq. \eqref{eq:recusive_rhs_final} yields the claimed inequality.
</p>
</div>
<p>
The following is our first convergence rate result and is valid for objectives with $L$-Lipschitz gradient but not necessarily convex. This was first proven by Simon Lacoste-Julien:<label for="lacoste-julien2016" class="margin-toggle sidenote-number"></label><input type="checkbox" id="lacoste-julien2016" class="margin-toggle" /><span class="sidenote">Lacoste-Julien, Simon. <a href="https://arxiv.org/pdf/1607.00345.pdf">"Convergence rate of Frank-Wolfe for non-convex objectives."</a> <i>arXiv preprint arXiv:1607.00345</i> (2016). </span>
</p>
<p class="framed"><b>Theorem 1: Convergence rate for general objectives</b>. If $f$ is differentiable with $L$-Lipschitz gradient, then we have the following $\mathcal{O}(1/\sqrt{t})$ bound on the best Frank-Wolfe gap:
\begin{equation}
\min_{0 \leq i\leq t} g_i \leq \frac{\max\{2 h_0, L \diam(\mathcal{C})^2\}}{\sqrt{t+1}}~,
\end{equation}
where $h_0 = f(\xx_0) - \min_{\xx \in \mathcal{C}} f(\xx)$ is the initial global suboptimality.
</p>
<div class="proof">
<p><label for="proof-lacoste-julien2016" class="margin-toggle sidenote-number"></label><input type="checkbox" id="proof-lacoste-julien2016" class="margin-toggle" /><span class="sidenote">This proof roughly follows that of <a href="https://arxiv.org/pdf/1607.00345.pdf">Lacoste-Julien (2016)</a>, with minor differences on how the case $g_t \geq C$ is handled and with a slightly different step size rule: while I consider the step sizes of Variant 1 and 2, Lacoste-Julien considers step sizes of the form of Variant 2 and \eqref{eq:step_size_curvature} </span>
By Lemma 1 we have the following sequence of inequalities, valid for any $\xi \in [0, 1]$:
\begin{align}
f(\xx_{t+1})&\leq f(\xx_t) - \xi g_t + \frac{\xi^2 L }{2}\diam(\mathcal{C})^2\\
&\leq f(\xx_t) - \xi g_t + \frac{\xi^2 C }{2}~,
\end{align}
with $C = L \diam(\mathcal{C})^2$. We consider the value of $\xi$ that minimizes the right hand size and we obtain $\xi^* = \min\{g_t/C, 1\}$.
We will now make a distinction of cases based on the value of $\xi^*$:
</p>
<ul>
<li>If $g_t \leq C$, then $\xi^* = g_t / C$ and using this value in the previous inequality we obtain the bound
\begin{equation}
f(\xx_{t+1}) \leq f(\xx_t) - \frac{g_t^2}{2 C}
\end{equation}
</li>
<li>If $g_t > C$, then $\xi^* = 1$ and we have the following sequence of inequalities
\begin{align}
f(\xx_{t+1}) &\leq f(\xx_t) - g_t + \frac{C}{2}\\
&\leq f(\xx_t) - \frac{g_t}{2} \quad \text{ (using $C \leq g_t$)}
\end{align}
</li>
</ul>
<p>Combining both cases we have
\begin{equation}
f(\xx_{t+1}) \leq f(\xx_t) - \frac{g_t}{2}\min\left\{\frac{g_t}{C}, 1\right\}
\end{equation}
Adding the previous inequality from iterate $0$ to $t$ and rearranging we have
\begin{align}
-h_0 \leq f(\xx_{t+1}) - f(\xx_0) &\leq - \sum_{i=0}^t \frac{g_i}{2}\min\left\{\frac{g_i}{C}, 1\right\} \\
&\leq - (t+1) \frac{g^*_t}{2}\min\left\{\frac{g^*_t}{C}, 1\right\}~,\label{eq:ineq_optim_gt}
\end{align}
<!--
\begin{align}
f(\xx_{t+1}) &\leq f(\xx_0) - \sum_{i=0}^t \frac{g_i}{2}\min\left\{\frac{g_i}{C}, 1\right\}\\
&\leq f(\xx_0) - (t+1) \frac{g^*_t}{2}\min\left\{\frac{g^*_t}{C}, 1\right\}~,
\end{align} -->
where $g_t^* = \min_{0\leq i\leq t} g_i$. Again, we make a distinction of cases, this time on $g_t^*$:</p>
<ul>
<li>If $g_t^* \leq C$, then $\min\{g^*_t/C, 1\} = g^*_t/C$ and solving for $g_t^*$ in the previous inequality we have
\begin{align}
g_t^* &\leq \sqrt{\frac{2 C h_0}{t+1}} \leq \frac{2 h_0 + C}{2\sqrt{t+1}} \\
& \leq \frac{\max\{2 h_0, C\}}{\sqrt{t+1}}\\
\end{align}
where in the second inequality we have used Young's inequality $ab \leq \frac{a^2}{2} + \frac{b^2}{2}$ with $a = \sqrt{2 h_0}, b = \sqrt{C}$.
</li>
<li>If $g_t^* > C$, then $\min\{g^*_t/C, 1\} = 1$ and rearranging \eqref{eq:ineq_optim_gt} we have the following inequality with the stronger $\mathcal{O}(1/t)$ rate, which we can trivially bound by an $\mathcal{O}(1/\sqrt{t})$ bound:
\begin{equation}
g_t^* \leq \frac{2 h_0}{t+1} \leq \frac{2 h_0}{\sqrt{t+1}} \leq \frac{\max\{2 h_0, C\}}{\sqrt{t+1}}
\end{equation}
</li>
</ul>
<p>
Hence, in both cases we have
\begin{equation}
g_t^* \leq \frac{\max\{2 h_0, C\}}{\sqrt{t+1}}~,
\end{equation}
and the claimed bound follows from the definition of $g_t^*$.
</p>
</div>
<p class="framed" id="theorem2"><b>Theorem 2: Convergence rate for convex objectives</b>. If $f$ is convex and differentiable with $L$-Lipschitz gradient, then we have the following convergence rate for the function suboptimality:
\begin{equation}
f(\xx_t) - f(\xx^\star) \leq \frac{2 L \diam(\mathcal{C})^2}{t+1}
\end{equation}
</p>
<div class="proof">
<p><label for="proof-wilson" class="margin-toggle sidenote-number"></label><input type="checkbox" id="proof-wilson" class="margin-toggle" /><span class="sidenote">This proof uses the same proof technique as that of Nesterov Y. <a href="https://link.springer.com/article/10.1007/s10107-017-1188-6">Complexity bounds for primal-dual methods minimizing the model of objective function</a> (2015). A proof with similar convergence rate but different proof techniquescan be found in other papers such as Martin Jaggi's <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">Revisiting Frank-Wolfe</a> or Francesco Locatello's <a href="http://proceedings.mlr.press/v54/locatello17a/locatello17a.pdf">A Unified Optimization View on
Generalized Matching Pursuit and Frank-Wolfe
</a>.</span>
Because of convexity we can obtain a tighter bound using the following simple inequality, mentioned earlier \eqref{eq:convexity_fw_gap}:
\begin{equation}\label{eq:convex_inequality}
f(\xx) - f(\xx^\star) \leq \langle\nabla f(\xx), \xx - \xx^\star\rangle~.
\end{equation}
Let $e_t = A_t (f(\xx_t) - f(\xx^\star))$ for a positive $A_t$ that will fix later and $C = L \diam(\mathcal{C})^2$. Then we have the following sequence of inequalities
\begin{align}
&e_{t+1} - e_t\nonumber\\
&\quad= A_{t+1}(f(\xx_{t+1}) - f(\xx^\star)) - A_t(f(\xx_t) - f(\xx^\star))\\
&\quad\leq A_{t+1}(f(\xx_{t}) - \xi g_t + \frac{\xi^2 C}{2} - f(\xx^\star)) - A_t(f(\xx_t) - f(\xx^\star))\\
&\quad\qquad \text{ (by Lemma 1, for any $\xi \in [0, 1]$)}\nonumber\\
&\quad\leq A_{t+1}(f(\xx_{t}) - f(\xx^\star) - \xi (f(\xx_t) - f(\xx^\star)) + \frac{\xi^2 C}{2})\nonumber\\
&\qquad - A_t(f(\xx_t) - f(\xx^\star))\qquad \text{ (by convexity \eqref{eq:convex_inequality})}\\
&\quad= ((1 - \xi) A_{t+1} - A_t) (f(\xx_{t}) - f(\xx^\star)) + A_{t+1}\frac{\xi^2 C}{2}\label{eq:convex_subopt_1}\\
\end{align}
Now, choosing $A_t = \frac{t(t+1)}{2}$, $\xi = 2/(t+2)$ we have:
\begin{align}
(1 - \xi) A_{t+1} - A_t &= \frac{t(t+1)}{2} - \frac{t(t+1)}{2} = 0\\
A_{t+1}\frac{\xi^2}{2} &= \frac{t+1}{t+2} \leq 1~,
\end{align}
and so replacing with these values of $A_t$ and $\xi$ in Eq. \eqref{eq:convex_subopt_1} gives
\begin{align}
e_{t+1} - e_t \leq C~.
\end{align}
Adding this inequality from $0$ to $t-1$ and using $e_0=0$ we have for $t> 0$:
\begin{equation}
e_{t} \leq t C \implies f(\xx_{t}) - f(\xx^\star) \leq \frac{2 C}{t+1}~,
\end{equation}
where in the last implication we have divided by $t(t+1)$ and so we need $t > 0$. The claimed bound now follows from definition of $C$.
</p>
</div>
<h3>Next posts</h3>
<p>This was the first post in a series dedicated to the Frank-Wolfe algorithm. In <a href="/blog/2018/fw2/">part 2</a>, I strengthen Theorem 2 with primal-dual guarantees. In upcoming posts, I will discuss other guarantees (primal-dual guarantees), as well as <a href="/blog/2022/adaptive_fw/">step-size strateguies</a>.</p>
<hr />
<h3>References</h3>
<figure class="fullwidth">
<ul style="width: 100%">
<li>Bach, Francis. <a href="https://arxiv.org/pdf/1211.6302.pdf">"Duality between subgradient and conditional gradient methods."</a> SIAM Journal on Optimization (2015).</li>
<li>Jaggi M. <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">"Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization"</a>. ICML (2013).</li>
<li>Lacoste-Julien S, Jaggi M. <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">"On the global linear convergence of Frank-Wolfe optimization variants"</a>. In Advances in Neural Information Processing Systems (2015).</li>
<li>Lacoste-Julien, Simon. <a href="https://arxiv.org/pdf/1607.00345.pdf">"Convergence rate of Frank-Wolfe for non-convex objectives."</a> arXiv preprint arXiv:1607.00345 (2016).</li>
<li>Locatello, F., Khanna, R., Tschannen, M. and Jaggi, M. <a href="http://proceedings.mlr.press/v54/locatello17a.html">"A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe". </a>, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017)</li>
<li>Nesterov, Yu. <a href="https://doi.org/10.1007/s10107-017-1188-6">"Complexity bounds for primal-dual methods minimizing the model of objective function."</a> Mathematical Programming (2017).</li>
<li>Pedregosa, Fabian and Askari, Armin and Negiar, Geoffrey and Jaggi, Martin (2018) <a href="https://arxiv.org/pdf/1806.05123.pdf">"Step-Size Adaptivity in Projection-Free Optimization"</a>. <i>arXiv:1806.05123</i></li>
<li><a href="https://youtu.be/24e08AX9Eww">Historial perspective on the Frank-Wolfe algorithm. </a>Marguerite Frank gives a beautiful historical perspective on the algorithm at the 2013 NIPS workshop <a href="https://sites.google.com/site/nips13greedyfrankwolfe/">"Greedy Algorithms, Frank-Wolfe and Friends"</a></li>
</ul>
</figure>
Optimization inequalities cheatsheet2017-01-11T00:00:00+01:002017-01-11T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2017-01-11:/blog/2017/optimization-inequalities-cheatsheet/
<p>Most proofs in optimization consist in using inequalities for a particular function class in some creative way.
This is a cheatsheet with inequalities that I use most often. It considers class of functions that are convex,
strongly convex and $L$-smooth.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX …</script>
<p>Most proofs in optimization consist in using inequalities for a particular function class in some creative way.
This is a cheatsheet with inequalities that I use most often. It considers class of functions that are convex,
strongly convex and $L$-smooth.
</p>
<script type="text/javascript" src="/theme/js/bibtexParse.js">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
},
});
</script>
<script type="text/javascript" async src="/node_modules/mathjax2/MathJax.js?config=TeX-AMS_CHTML">
</script>
<script type="text/javascript" src="/theme/js/refs_v1.js"></script>
<script type="text/javascript">
const bibtex = ``;
document.addEventListener('DOMContentLoaded', doReferences, false);
document.addEventListener('DOMContentLoaded', doTOC, false);
</script>
<div id="TOC"></div>
<p>
<b>Setting</b>. $f$ is a function $\mathbb{R}^p \to \mathbb{R}$. Below are a set of inequalities verified when $f$
belongs to a particular class of functions and $x, y, z \in \mathbb{R}^p$ are arbitrary elements in its domain. For
simplicity I'm assuming that functions are differentiable, but most of these are also true replacing the gradient
with a sub-gradient.
</p>
<h2>$f$ is $L$-smooth</h2>
<p>
This is the class of functions that are differentiable and its gradient is Lipschitz continuous.
<ol type="a">
<li>$\|\nabla f(y) - \nabla f(x) \| \leq L \|x - y\|$</li>
<li>$|f(y) - f(x) - \langle \nabla f(x), y - x\rangle| \leq \frac{L}{2}\|y - x\|^2$</li>
<li>$\nabla^2 f(x) \preceq L\qquad \text{ (assuming $f$ is twice differentiable)}
$</li>
</ol>
</p>
<div class="wrap-collabsible"> <input id="collapsible1" class="toggle" type="checkbox"> <label for="collapsible1" class="lbl-toggle" tabindex="0"><b>Show proof</b></label>
<div class="collapsible-content">
<div class="content-inner">
<div class="proof" id="proof-smooth">
<ol type="a">
<li>is the definition of gradient Lipschitz.</li>
<li>follows from Taylor's theorem with remainder.</li>
<li> follows from
using $y = x + \varepsilon s$ in a and taking the limit $\varepsilon \to 0$.</li>
</ol>
</div>
</div>
</div>
</div>
<h2>$f$ is convex</h2>
<ol type="a">
<li>$f(\lambda x + (1 - \lambda)y) \leq \lambda f(x) + (1 - \lambda)f(y)$ for all $\lambda \in [0, 1]$.</li>
<li>$f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle$</li>
<li>$0 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$</li>
<li>$f(\mathbb{E}X) \leq \mathbb{E}[f(X)]$ where $X$ is a random variable <a href="https://en.wikipedia.org/wiki/Jensen's_inequality">(Jensen's inequality)</a>.</li>
<li>$x = \text{prox}_{\gamma f}(x) + \gamma \text{prox}_{f^*/\gamma}(x/\gamma)$, where $f^*$ is the Fenchel
conjugate and $\text{prox}_{\gamma f}(x)$ is the proximal operator of $\gamma f$. This identity is sometimes
referred to as Moreau's decomposition</li>
</ol>
</p>
<h2>$f$ is both $L$-smooth and convex</h2>
<ol type="a">
<li>$\frac{1}{L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$</li>
<li>$0 \leq f(y) - f(x) - \langle \nabla f(x), y - x\rangle \leq \frac{L}{2}\|x - y\|^2$</li>
<li>$f(x) \leq f(y) + \langle \nabla f(x), x - y\rangle - \frac{1}{2 L}\|\nabla f(x) - \nabla f(y)\|^2$</li>
<li>$f(x) \leq f(y) + \langle \nabla f(z), x - y \rangle + \frac{L}{2}\|x - z\|^2$ (three points descent lemma)</li>
</ol>
</p>
<div class="wrap-collabsible"> <input id="collapsible3" class="toggle" type="checkbox"> <label for="collapsible3" class="lbl-toggle" tabindex="0"><b>Show proof</b></label>
<div class="collapsible-content">
<div class="content-inner">
<div class="proof" id="proof-smooth-cvx">
<ol type="a">
<li>Follows from using inequality 2.b with the function $f - \frac{\mu}{2}\|\cdot\|^2$.</li>
<li>TODO</li>
<li>
The function $f$ being smooth implies that its Fenchel conjugate $f^\star$ is $\frac{1}{L}$-strongly
convex, so using inequality 4.a with $f^\star$ we get
\begin{equation}
f^\star(\alpha) \leq f^\star(\beta) + \langle \nabla f^\star(\alpha), \alpha - \beta\rangle -
\frac{1}{2L}\|\alpha - \beta\|^2
\end{equation}
Using the <a href="https://en.wikipedia.org/wiki/Convex_conjugate#Fenchel's_inequality">Fenchel
identities</a> $f^\star(\nabla f(x)) = \langle \nabla f(x), x\rangle - f(x)$ and $\nabla f^\star(\nabla
f(x)) = x$ with $\alpha = \nabla f(y)$ and $\beta = \nabla f(x)$ we get
\begin{align}
\langle \nabla f(y), y\rangle - f(y) \leq \langle \nabla f(x), x\rangle - f(x) + \langle y, \nabla f(y) -
\nabla f(x)\rangle - \frac{1}{2L}\|\nabla f(y) - \nabla f(x)\|^2
\end{align}
Rearranging terms gives the desired inequality.<dt-note>There are other ways to prove this inequality that
don't require to use the heavy machinery of Fenchel conjugates, see for example <a href="https://link.springer.com/book/10.1007/978-1-4419-8853-9">Nesterov's book</a>. Personally, I find Nesterov's proof
rather "magical", as one needs to come up with the right proxy functions, and prefer the
straightforwardness of the above one.</dt-note>
</li>
</ol>
</div>
</div>
</div>
</div>
<h2>$f$ is $\mu$-strongly convex</h2>
<p>
This includes the set of functions $f$ such that $f - \frac{\mu}{2}\|\cdot\|^2$ is convex. It
includes the set of convex functions with $\mu=0$. Here $x^*$ denotes the minimizer of $f$.
</p>
<ol type="a">
<li>$f(x) \leq f(y) + \langle \nabla f(x), x - y \rangle - \frac{\mu}{2}\|x - y\|^2$</li>
<li>$f(x) \leq f(y) + \langle \nabla f(y), x - y\rangle + \frac{1}{2\mu}\|\nabla f(x) - \nabla f(y)\|^2$</li>
<li>$\mu\|x - y\|^2 \leq \langle \nabla f(x) - \nabla f(y), x - y\rangle$</li>
<li>$\frac{\mu}{2}\|x-x^*\|^2\leq f(x) - f(x^*)$</li>
<li>$f(\alpha x + (1 - \alpha)y) \leq \alpha f(x) + (1 - \alpha)f(y) - \alpha(1 - \alpha)\frac{\mu}{2}\|x - y\|^2$
</li>
</ol>
<div class="wrap-collabsible"> <input id="collapsible4" class="toggle" type="checkbox"> <label for="collapsible4" class="lbl-toggle" tabindex="0"><b>Show proof</b></label>
<div class="collapsible-content">
<div class="content-inner">
<div class="proof" id="proof-mu-cvx">
<ol type="a">
<li>Follows from using inequality 2.b with the function $f - \frac{\mu}{2}\|\cdot\|^2$.
</li>
</ol>
</div>
</div>
</div>
</div>
<h2>$f$ is both $L$-smooth and $\mu$-strongly convex.</h2>
<ol type="a">
<li>$\frac{\mu L}{\mu + L}\|x - y\|^2 + \frac{1}{\mu + L}\|\nabla f(x) - \nabla f(y)\|^2 \leq \langle \nabla f(x) -
\nabla f(y), x - y\rangle$</li>
<li>$\mu \preceq \nabla^2 f(x) \preceq L \qquad \text{ (assuming $f$ is twice differentiable)}$</li>
<!-- <li>$f(y) - f(x) - \langle\nabla f(x), y - x \rangle \geq \frac{1}{2(1 - \mu/L)}\left(\frac{1}{L}\|\nabla f(x) - \nabla f(y)\|^2 + \mu \|x - y\|^2 - 2 \frac{\mu}{L}\langle \nabla f(x) - \nabla f(y), x - y\rangle \right)$</li> -->
<li>$f(x) \leq f(y)+ \langle \nabla f(x), x - y\rangle - \frac{\mu}{2}\|x - y\|^2 - \frac{1}{2 (L - \mu)}\|\nabla f(x) - \nabla f(y) - \mu(x - y)\|^2$</li>
</ol>
<div class="wrap-collabsible"> <input id="collapsible5" class="toggle" type="checkbox"> <label for="collapsible5" class="lbl-toggle" tabindex="0"><b>Show proof</b></label>
<div class="collapsible-content">
<div class="content-inner">
<div class="proof" id="proof-mu-cvx-L-smooth">
<ol type="a">
<li>TODO</li>
<li>TODO</li>
<li>
We have that if $f$ is $L$-smooth and $\mu$-strongly convex then $f - \frac{\mu}{2}\|\cdot\|^2$ is $(L-\mu)$-smooth and convex. Using inequality 3.c on the function $f - \frac{\mu}{2}\|\cdot\|^2$ we then have
\begin{align}
f(x) &\leq f(y) - \frac{\mu}{2}\|y\|^2 + \frac{\mu}{2}\|x\|^2 + \langle \nabla f(x) - \mu x, x - y\rangle - \frac{1}{2 (L - \mu)}\|\nabla f(x) - \nabla f(y) - \mu(x - y)\|^2 \\
&= f(y)+ \langle \nabla f(x), x - y\rangle - \frac{\mu}{2}\|x - y\|^2 - \frac{1}{2 (L - \mu)}\|\nabla f(x) - \nabla f(y) - \mu(x - y)\|^2
\end{align}
</li>
</ol>
</div>
</div>
</div>
</div>
<h3>Misc</h3>
<p>
Another great source of general inequalities is <a href="https://www.lkozma.net/inequalities_cheat_sheet/ineq.pdf">this cheatsheet</a> by László Kozma.
</p>
<h3>References</h3>
<p>
Most of these inequalities appear in the Book: "Introductory lectures on convex optimization: A basic course" by
Nesterov (2013, Springer Science & Business Media). Another good (and free) resource is the book <a href="http://stanford.edu/~boyd/cvxbook/">"Convex Optimization"</a> by
Stephen Boyd and Lieven Vandenberghe.
</p>
</p>
<!-- <h2>Proximal operator</h2>
<p>
The proximal operator of a clsc function is
</p> -->
A fully asynchronous variant of the SAGA algorithm2016-10-12T00:00:00+02:002016-10-12T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2016-10-12:/blog/2016/a-fully-asynchronous-variant-of-the-saga-algorithm/<p>My friend <a href="http://www.di.ens.fr/~rleblond/">Rémi Leblond</a> has recently uploaded to ArXiv <a href="https://arxiv.org/abs/1606.04809">our preprint on an asynchronous version of the SAGA optimization algorithm</a>.</p>
<p>The main contribution is to develop a parallel (fully asynchronous, no locks) variant of the <a href="http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives.pdf">SAGA algorighm</a>. This is a stochastic variance-reduced method for general optimization, specially adapted for problems …</p><p>My friend <a href="http://www.di.ens.fr/~rleblond/">Rémi Leblond</a> has recently uploaded to ArXiv <a href="https://arxiv.org/abs/1606.04809">our preprint on an asynchronous version of the SAGA optimization algorithm</a>.</p>
<p>The main contribution is to develop a parallel (fully asynchronous, no locks) variant of the <a href="http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives.pdf">SAGA algorighm</a>. This is a stochastic variance-reduced method for general optimization, specially adapted for problems that arise frequently in machine learning such as (regularized) least squares and logistic regression. Besides the specification of the algorithm, we also provide a convergence proof and convergence rates. Furthermore, we fix some subtle technical issues present in previous literature (proving things in the asynchronous setting is hard!).</p>
<p>The core of the asynchronous algorithm is similar to <a href="https://arxiv.org/abs/1106.5730">Hogwild!</a>, a popular asynchronous variant of stochastc gradient descent (SGD). The main difference is that instead of using SGD as a building block, we use SAGA. This has many advantages (and poses some challenges): faster (exponential!) rates of convergence and convergence to arbitrary precision with a fixed step size (hence clear stopping criterion), to name a few.</p>
<p>The speedups obtained versus the sequential version are quite impressive. For example, we have observed to commonly obtain 5x-7x speedups using 10 cores:</p>
<p style="text-align: center">
<img src="/images/2016/figure_asaga.png" width="600px"/>
</p>
<p>Update April 2017: this work has been presented at <a href="http://proceedings.mlr.press/v54/leblond17a.html">AISTATS 2017</a>.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Had a great time presenting our work at <a href="https://twitter.com/hashtag/AISTATS2017?src=hash">#AISTATS2017</a>, can't wait for the next edition! <a href="https://t.co/W58K5IVRio">pic.twitter.com/W58K5IVRio</a></p>— Fabian Pedregosa (@fpedregosa) <a href="https://twitter.com/fpedregosa/status/856565126744334336">April 24, 2017</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Update Dec 2017: a follow-up of this work generalizes the algorithm to the more general class of composite objective functions. See our paper <a href="http://papers.nips.cc/paper/6611-breaking-the-nonsmooth-barrier-a-scalable-parallel-method-for-composite-optimization.pdf">"Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization"</a></p>Hyperparameter optimization with approximate gradient2016-05-25T00:00:00+02:002016-05-25T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2016-05-25:/blog/2016/hyperparameter-optimization-with-approximate-gradient/
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
Macros: {
RR: "{\\mathbb{R}}",
argmin: "{\\mathop{\\mathrm{arg\\,min}}}",
bold: ["{\\bf #1}",1]
}
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p><b>TL;DR:</b> I describe a <a href="http://arxiv.org/abs/1602.02355">method for hyperparameter optimization</a> by gradient descent.</p>
<p>Most machine …</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
Macros: {
RR: "{\\mathbb{R}}",
argmin: "{\\mathop{\\mathrm{arg\\,min}}}",
bold: ["{\\bf #1}",1]
}
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p><b>TL;DR:</b> I describe a <a href="http://arxiv.org/abs/1602.02355">method for hyperparameter optimization</a> by gradient descent.</p>
<p>Most machine learning models rely on at least one hyperparameter to control for model complexity. For example, logistic regression commonly relies on a regularization parameter that controls the amount of $\ell_2$ regularization. Similarly, kernel methods also have hyperparameters that control for properties of the kernel, such as the "width" parameter in the RBF kernel. The fundamental distinction between model parameters and hyperparameters is that, while model parameters are estimated by minimizing a goodness of fit with the training data, hyperparameters need to be estimated by other means (such as a cross-validation loss), as otherwise models with excessive would be selected, a phenomenon known as <i>overfitting</i>.
<span class="marginnote">
<img style="display: block; margin: 0 auto;width: 250px; margin: 10px" src="http://fa.bianp.net/images/2016/approx_grad.png" />
We can use an approximate gradient to optimize a cross-validation loss with respect to hyperparameters. A decreasing bound between the true gradient and the approximate gradient ensures that the method converges towards a local minima.
</span></p>
<p>Fitting hyperparameters is essential to obtain models with good accuracy and computationally challenging. The existing most popular methods for fitting hyperparameters are based on either exhaustively exploring the whole hyperparameter space (grid search and random search) or on Bayesian optimization techniques that use previous function evaluations to guide the optimization procedure. <b>The starting point of this work was a simple question: why are the procedures to estimate parameters and hyperparameters so different?</b> Is it possible to use known and reliable methods such as gradient descent to fit not only parameters, but also hyperparameters?</p>
<p>Interestingly, I found out that this question had been answered a long time ago. Already in the 90s, Larsen et al. devised a method (described <a href=" http://dx.doi.org/10.1109/NNSP.1996.548336">here</a> and <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.9956&rep=rep1&type=pdf">here</a>) using gradient-descent to estimate the optimal value of $\ell_2$ regularization for neural networks. Shortly after, <a href="http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015187">Y. Bengio</a> also published a paper on this topic. Recently, there has been a renewed interest in gradient-based methods (see for example <a href="http://arxiv.org/abs/1502.03492">this paper</a> by Maclaurin or a slightly <a href="https://justindomke.wordpress.com/2014/02/03/truncated-bi-level-optimization/">earlier work</a> by Justin Domke, and references therein).</p>
<p>One of the drawbacks of gradient-based optimization of hyperparameters, is that these depend on quantities that are costly to compute such as the exact value of the model parameters and the inverse of a Hessian matrix. The aim of this work is to relax some of these assumptions and provide a method that works when the quantities involved (such as model parameters) are known only approximately. In practice, what this means is that hyperparameters can be updated before model parameters have fully converged, which results in big computational gains. For more details and experiments, please take a look at <a href="http://arxiv.org/abs/1602.02355">the paper</a>. </p>
<p>This paper was presented at the International Conference on Machine Learning (<a href="http://icml.cc/2016/">ICML 2016</a>).Code is now available in <a href="https://github.com/fabianp/hoag">github</a> and these are the slides I used for the occasion:</p>
<p style="text-align: center">
<iframe src="//www.slideshare.net/slideshow/embed_code/key/D9u7kfb23OTE0V" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="//www.slideshare.net/FabianPedregosa/hyperparameter-optimization-with-approximate-grradient" title="Hyperparameter optimization with approximate gradient" target="_blank">Hyperparameter optimization with approximate grradient</a> </strong> from <strong><a href="//www.slideshare.net/FabianPedregosa" target="_blank">Fabian Pedregosa</a></strong> </div>
</p>
<h3>Reviews</h3>
<p>
The original ICML reviews for this paper can be seen <a href="http://icml.cc/2016/reviews/337.txt">here</a> and the rebuttal <a href="http://icml.cc/2016/rebuttals/337.txt">here</a>. These were high quality reviews that had some rightful concerns with the first version of the manuscript. In fact, 2 out of 3 reviewers gave a "weak reject" in this first phase. Luckily these concerns could be contested in the rebuttal (the final manuscript was updated accordingly) and the 2 reviewers that gave a "weak reject" changed their rating to "weak accept" and the paper was finally accepted.
</p>
<h3>(Relatively) Frequently Asked Questions</h3>
<ul>
<li>The outer loss function does not depend directly on the regularization parameter $\lambda$. Why is it there?</li>
</ul>
<p>When the outer loss is a cross-validation loss this is indeed the case, but other criteria might depend on this parameter, such as the SURE criteria (see e.g. <a href="https://arxiv.org/abs/1405.1164">Deladalle et al. 2014</a>).</p>
<h3>Citing</h3>
<p>Please cite this work if the paper or its associated code are relevant for you. You can use the following bibtex:
<code style="font-size: 90%"><pre>
@inproceedings{PedregosaHyperparameter16,
author = {Fabian Pedregosa},
title = {Hyperparameter optimization with approximate gradient},
booktitle = {Proceedings of the 33nd International Conference on Machine Learning ({ICML})},
year = {2016},
url = {http://proceedings.mlr.press/v48/pedregosa16.html},
}
</pre></code></p>
<h3>Erratum</h3>
<p>An early version of the paper contained the following typo: the first equation in Theorem 2 should read $\sum_{i=1}^\infty \varepsilon_i \lt \infty$ instead of $\sum_{i=1}^\infty \varepsilon_i \leq \infty$ (note the $<$ in the first equation versus $\leq$ in the second one). This typo has been corrected in both ArXiv version and <a href="http://proceedings.mlr.press/v48/pedregosa16.pdf">the proceedings</a>.</p>
Lightning v0.12016-03-25T00:00:00+01:002016-03-25T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2016-03-25:/blog/2016/lightning-v01/<p>Announce: first public release of <a href="http://contrib.scikit-learn.org/lightning/">lightning</a>!, a library for large-scale linear classification, regression and ranking in Python. The library was started a couple of years ago by <a href="http://mblondel.org">Mathieu Blondel</a> who also contributed the vast majority of source code. I joined recently its development and decided it was about time for …</p><p>Announce: first public release of <a href="http://contrib.scikit-learn.org/lightning/">lightning</a>!, a library for large-scale linear classification, regression and ranking in Python. The library was started a couple of years ago by <a href="http://mblondel.org">Mathieu Blondel</a> who also contributed the vast majority of source code. I joined recently its development and decided it was about time for a v0.1!.</p>
<p>Prebuild conda packages are available for all operating systems (god thank appveyor). More information on <a href="http://contrib.scikit-learn.org/lightning/">lightning's website</a>.</p>scikit-learn-contrib, an umbrella for scikit-learn related projects.2016-03-06T00:00:00+01:002016-03-06T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2016-03-06:/blog/2016/scikit-learn-contrib-an-umbrella-for-scikit-learn-related-projects/<p>Together with other scikit-learn developers we've created an umbrella organization for scikit-learn-related projects named <a href="https://github.com/scikit-learn-contrib">scikit-learn-contrib</a>. The idea is for this organization to host projects that are deemed too specific or too experimental to be included in the scikit-learn codebase but still offer an API which is compatible with scikit-learn and …</p><p>Together with other scikit-learn developers we've created an umbrella organization for scikit-learn-related projects named <a href="https://github.com/scikit-learn-contrib">scikit-learn-contrib</a>. The idea is for this organization to host projects that are deemed too specific or too experimental to be included in the scikit-learn codebase but still offer an API which is compatible with scikit-learn and would like to benefit of the visibility of being labeled as scikit-learn-compatible.</p>
<p style="text-align: center; float: left; margin: 30px"><img src="https://avatars3.githubusercontent.com/u/17349883?v=3" height="200px"/></p>
<p>We've set two requirements to being under this umbrella right now (this might evolve in the future). The first requirement is to have a scikit-learn compatible API, i.e., to follow <a href="http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects">the guide</a> on the scikit-learn documentation so that objects can be used by scikit-learn meta-estimators (such as <a href="http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html">GridSearchCV</a>). The second condition is that projects should be actively maintaned and have a high-quality codebase. Judging the quality of a codebase is difficult and subjective, but we agreed that at the bare minimum, the source code should be tested using continuous integration tools such as <a href="https://travis-ci.org/">travis</a> and reach a good test coverage (above 80%). More information is not available on the <a href="https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/workflow.md">scikit-learn-contrib repository</a>.</p>
<p>The first project to be hosted by this organization is <a href="http://contrib.scikit-learn.org/lightning/">lightning</a>, but we hope that others will follow. If you would like to submit a new project, open an issue at the <a href="https://github.com/scikit-learn-contrib/scikit-learn-contrib">main project</a> and we will look into it. There is also a <a href="https://github.com/scikit-learn-contrib/project-template">project template</a> for new and old projects.</p>SAGA algorithm in the lightning library2016-02-22T00:00:00+01:002016-02-22T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2016-02-22:/blog/2016/saga-algorithm-in-the-lightning-library/<p>Recently I've implemented, together with <a href="http://arachez.com/">Arnaud Rachez</a>, the SAGA[<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] algorithm in the <a href="http://contrib.scikit-learn.org/lightning/">lightning</a> machine learning library (which by the way, has been recently moved to the new <a href="https://github.com/scikit-learn-contrib">scikit-learn-contrib</a> project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA …</p><p>Recently I've implemented, together with <a href="http://arachez.com/">Arnaud Rachez</a>, the SAGA[<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] algorithm in the <a href="http://contrib.scikit-learn.org/lightning/">lightning</a> machine learning library (which by the way, has been recently moved to the new <a href="https://github.com/scikit-learn-contrib">scikit-learn-contrib</a> project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA algorithm, its performance is similar to other variance-reduced stochastic algorithms such as SAG[<sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup>] or SVRG[<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>] but it has the advantage with respect to SAG[<sup id="fnref2:3"><a class="footnote-ref" href="#fn:3">3</a></sup>] that it allows non-smooth penalty terms (such as $\ell_1$ regularization). It is implemented in lightning as <a href="http://scikit-learn-contrib.github.io/lightning/generated/lightning.classification.SAGAClassifier.html">SAGAClassifier</a> and <a href="http://scikit-learn-contrib.github.io/lightning/generated/lightning.regression.SAGARegressor.html">SAGARegressor</a>.</p>
<p>We have taken care to make this implementation as efficient as possible. As for most stochastic gradient algorithms, a naive implementation takes 3 lines of code and is straightforward to implement. However, there are many tricks that are time-consuming and error-prone to implement but make a huge difference in efficiency.</p>
<p>A small example, more as a sanity check than to claim anything. The following plot shows the suboptimality as a function of time for three similar methods: SAG, SAGA and SVRG. The dataset used in the RCV1 dataset (test set, obtained from the <a href="https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html">libsvm webpage</a>), consisting of 677.399 samples and 47.236 features. Interestingly, all methods can solve this rather large-scale problem within a few seconds. Within them, SAG and SAGA have a very similar performance and SVRG seems to be reasonably faster.</p>
<p style="text-align: center;">
<img src="/images/2016/rcv1_comparison.png" width="500px"/>
</p>
<p>A note about the benchmarks: it is difficult to compare fairly stochastic gradient methods because at the end it usually boils down to how you choose the step size. In this plot I set the step size of all methods to 1/(3L) , where L is the Lipschitz constant of the objective function, as I think this is a popular choice. I would have prefered 1/L but SVRG was not converging for this step size. The code for the benchmarks can be found <a href="https://gist.github.com/fabianp/fdfb77c5d1835cc1fb17">here</a>.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>A. Defazio, F. Bach & S. Lacoste-Julien. "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives" (2014). <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in Neural Information Processing Systems. 2013. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Mark Schmidt, Nicolas Le Roux, and Francis Bach. "Minimizing finite sums with the stochastic average gradient." arXiv preprint arXiv:1309.2388 (2013). <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a><a class="footnote-backref" href="#fnref2:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>On the consistency of ordinal regression methods2015-10-09T00:00:00+02:002015-10-09T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2015-10-09:/blog/2015/on-the-consistency-of-ordinal-regression-methods/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>My latests work (with <a href="http://www.di.ens.fr/~fbach/">Francis Bach</a> and <a href="http://alexandre.gramfort.net/">Alexandre Gramfort</a>) is on the consistency of ordinal regression methods. It has the wildly imaginative …</p><script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>My latests work (with <a href="http://www.di.ens.fr/~fbach/">Francis Bach</a> and <a href="http://alexandre.gramfort.net/">Alexandre Gramfort</a>) is on the consistency of ordinal regression methods. It has the wildly imaginative title of "On the Consistency of Ordinal Regression Methods" and is currently under review but you can read the draft of it <a href="http://arxiv.org/abs/1408.2327">on ArXiv</a>. If you have any thoughts about it, please leave me a comment!</p>
<p>** Update July 2017: this paper was published on the Journal of Machine Learning Research. The published version can be found <a href="http://jmlr.org/papers/v18/15-495.html">here</a> **</p>
<h3>Ordinal what?</h3>
<p>The problem of ordinal regression is an old one in supervised learning. Its roots can be traced back to the works of McCullagh[<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] in the 80s. It is a supervised learning problem that shares properties --yet is fundamentally different-- with both multiclass classification and regression. It can be seen as the problem of predicting a target variable from labeled observations, where the target label consists of discrete and ordered labels. As in the multiclass classification setting, the target variables are of discrete nature, and as in the regression setting (but unlike the multiclass setting) there is a meaningful order between the classes.</p>
<p>The most popular example of ordinal regression arise when the target variable is a human generated rating. For example, for a movie recommendation system, the target variable can have the possible values “do-not-bother” ≺ “only-if-you-must” ≺ “good” ≺ “verygood” ≺ “run-to-see”. Using multiclass classification to predict this target would yield a suboptimal classifier since it ignores the fact that there is a natural ordering between the labels. On the other hand, a regression algorithm assumes that the target variable is continuous, while here it is clearly discrete. Ordinal regression would be the ideal model for this target variable.</p>
<h3>Fisher consistency</h3>
<p>The notion of Fisher consistency is also an old one in statistics, and goes back to the work of Fisher at the beginning of the 20th century. The rigorous definition is stated in the paper, so here I'll just give an intuition.</p>
<p>In supervised learning, we observe random samples (in the form of pairs (target, sample) usually denoted $(y_i, X_i)$ ) from a population (lets call it P) and build a model that predicts the target when he seems a new sample. Fisher consistency can be seen as a sanity check on the learning model, that states that if instead of seeing a random sample we would have access to the full population P (which in real life never happens), then our classifier would have an accuracy as good as the best possible accuracy (such classifier is usually called Bayes rule or Bayes predictor).</p>
<p>Having Fisher consistency is an important property that "allows us to design good loss functions with desirable properties"[<sup id="fnref:7"><a class="footnote-ref" href="#fn:7">7</a></sup>]. Because of this, in the last decade the Fisher consistency of most used supervised learning methods has been investigated. It has been shown (see e.g.[<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>]) that most common methods for binary classification are consistent. For the multiclass case and ranking, the situation is more interesting, with some methods that are known to be inconsistent, such as one-vs-all SVM in multiclass classification and RankSVM.</p>
<p>The study of Fisher consistency for ordinal regression methods has been done for the first time (to the best of my knowledge) <a href="http://arxiv.org/abs/1408.2327">here</a> and proves that despite the negative results of multiclass classification[<sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup>][<sup id="fnref:4"><a class="footnote-ref" href="#fn:4">4</a></sup>] and ranking[<sup id="fnref:5"><a class="footnote-ref" href="#fn:5">5</a></sup>][<sup id="fnref:6"><a class="footnote-ref" href="#fn:6">6</a></sup>], common ordinal regression methods are Fisher consistent. This brings ordinal regression closer to binary classification than to multiclass classification in this respect. And in fact, some results in the paper can be seen as generalization of known results for binary classification.</p>
<h3>Highlights</h3>
<p>In the paper we study the Fisher consistency of some popular ordinal regression methods. The methods that we analyze are the following (see Table 1 in the paper of a definition): all threshold, cumulative link, immediate threshold and last absolute deviation. In the paper we present the following results</p>
<ul>
<li>We characterize the consistency of all threshold method and immediate threshold by the derivative at zero of an auxiliary function.</li>
<li>We provide an excess risk bound of the all threshold loss.</li>
<li>Prove consistency of the least absolute deviation. This was already done for the case of three classes by Ramaswamy et al.[<sup id="fnref:8"><a class="footnote-ref" href="#fn:8">8</a></sup>], here we extend the proof to an arbitrary number of classes.</li>
<li>Prove consistency of cumulative link model (a model that includes the venerable proportional odds model of McCullagh<sup id="fnref2:1"><a class="footnote-ref" href="#fn:1">1</a></sup>).</li>
<li>The consistency analysis suggest a novel loss function when optimizing a least squares metric. We test this novel model on different datasets and report that it performs very competitively.</li>
</ul>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>McCullagh, Peter. "Regression models for ordinal data." Journal of the royal statistical society. Series B (Methodological) (1980). <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref2:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Bartlett, Peter L., Michael I. Jordan, and Jon D. McAuliffe. "Convexity, classification, and risk bounds." Journal of the American Statistical Association (2006). <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Tewari, Ambuj, and Peter L. Bartlett. "On the consistency of multiclass classification methods." The Journal of Machine Learning Research 8 (2007). <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>Zhang, Tong. "Statistical analysis of some multi-category large margin classification methods." The Journal of Machine Learning Research 5 (2004). <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p>Duchi, John C., Lester W. Mackey, and Michael I. Jordan. "On the consistency of ranking algorithms." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010. <a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:6">
<p>Calauzenes, Clément, Nicolas Usunier, and Patrick Gallinari. "On the (non-) existence of convex, calibrated surrogate losses for ranking." Neural Information Processing Systems. 2012. <a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:7">
<p>Zhang, Tong. "Statistical behavior and consistency of classification methods based on convex risk minimization." Annals of Statistics (2004) <a class="footnote-backref" href="#fnref:7" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
<li id="fn:8">
<p>Ramaswamy, Harish G., and Shivani Agarwal. "Classification calibration dimension for general multiclass losses." Advances in Neural Information Processing Systems. 2012. <a class="footnote-backref" href="#fnref:8" title="Jump back to footnote 8 in the text">↩</a></p>
</li>
</ol>
</div>Holdout cross-validation generator2015-08-20T00:00:00+02:002015-08-20T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2015-08-20:/blog/2015/holdout-cross-validation-generator/<p><a href="http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators">Cross-validation iterators</a> in scikit-learn are simply generator objects, that is, Python objects that implement the <code>__iter__</code> method and that for each call to this method return (or more precisely, <code>yield</code>) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as …</p><p><a href="http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators">Cross-validation iterators</a> in scikit-learn are simply generator objects, that is, Python objects that implement the <code>__iter__</code> method and that for each call to this method return (or more precisely, <code>yield</code>) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API. </p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.utils</span> <span class="kn">import</span> <span class="n">check_random_state</span>
<span class="k">class</span> <span class="nc">HoldOut</span><span class="p">:</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Hold-out cross-validator generator. In the hold-out, the</span>
<span class="sd"> data is split only once into a train set and a test set.</span>
<span class="sd"> Unlike in other cross-validation schemes, the hold-out</span>
<span class="sd"> consists of only one iteration.</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> n : total number of samples</span>
<span class="sd"> test_size : 0 < float < 1</span>
<span class="sd"> Fraction of samples to use as test set. Must be a</span>
<span class="sd"> number between 0 and 1.</span>
<span class="sd"> random_state : int</span>
<span class="sd"> Seed for the random number generator.</span>
<span class="sd"> """</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">n</span> <span class="o">=</span> <span class="n">n</span>
<span class="bp">self</span><span class="o">.</span><span class="n">test_size</span> <span class="o">=</span> <span class="n">test_size</span>
<span class="bp">self</span><span class="o">.</span><span class="n">random_state</span> <span class="o">=</span> <span class="n">random_state</span>
<span class="k">def</span> <span class="fm">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">n_test</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ceil</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">test_size</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">n</span><span class="p">))</span>
<span class="n">n_train</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">n</span> <span class="o">-</span> <span class="n">n_test</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">check_random_state</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">random_state</span><span class="p">)</span>
<span class="n">permutation</span> <span class="o">=</span> <span class="n">rng</span><span class="o">.</span><span class="n">permutation</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n</span><span class="p">)</span>
<span class="n">ind_test</span> <span class="o">=</span> <span class="n">permutation</span><span class="p">[:</span><span class="n">n_test</span><span class="p">]</span>
<span class="n">ind_train</span> <span class="o">=</span> <span class="n">permutation</span><span class="p">[</span><span class="n">n_test</span><span class="p">:</span><span class="n">n_test</span> <span class="o">+</span> <span class="n">n_train</span><span class="p">]</span>
<span class="k">yield</span> <span class="n">ind_train</span><span class="p">,</span> <span class="n">ind_test</span>
</code></pre></div>
<p>Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of <a href="http://archive.numdam.org/ARCHIVE/PS/PS_2005__9_/PS_2005__9__323_0/PS_2005__9__323_0.pdf">Theory of classification: a survey of some recent advances</a> and the very recent <a href="https://www.sciencemag.org/content/349/6248/636.short">The reusable holdout</a>.</p>IPython/Jupyter notebook gallery2015-04-21T00:00:00+02:002015-04-21T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2015-04-21:/blog/2015/ipythonjupyter-notebook-gallery/<p style="font-weight: bold; color:red">Due to lack of time and interest, I'm no longer maintaining this project. Feel free to grab the sources from <a href="https://github.com/fabianp/nbgallery">https://github.com/fabianp/nbgallery</a> and fork the project. </p>
<p>TL;DR I created a gallery for IPython/Jupyter notebooks. <a href="http://nb.bianp.net">Check it out :-)</a></p>
<div style="text-align: center">
<img alt="Notebook gallery" width="600px" src="http://fa.bianp.net/uploads/2015/screenshot_nbgallery.png" />
</div>
<p>A couple of months ago I put online …</p><p style="font-weight: bold; color:red">Due to lack of time and interest, I'm no longer maintaining this project. Feel free to grab the sources from <a href="https://github.com/fabianp/nbgallery">https://github.com/fabianp/nbgallery</a> and fork the project. </p>
<p>TL;DR I created a gallery for IPython/Jupyter notebooks. <a href="http://nb.bianp.net">Check it out :-)</a></p>
<div style="text-align: center">
<img alt="Notebook gallery" width="600px" src="http://fa.bianp.net/uploads/2015/screenshot_nbgallery.png" />
</div>
<p>A couple of months ago I put online a website that displays a collection of IPython/Jupyter notebooks. The is a website that collects user-submitted and publicly available notebooks and displays them with a nice screenshot. The great thing about this website compared to <a href="https://wakari.io/gallery">other</a> <a href="https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks">similar</a> efforts is that this one updates and categorizes the notebooks (by date and views) automatically. You can even search for anything in this database that contains already more than 400 notebooks!</p>
<h2>Vision</h2>
<p>I would like to make a website where it is possible to</p>
<ol>
<li>
<p>Find the notebook you are looking for. There is precious information (examples, documentation, tutorials, etc.) contained within IPython/Jupyter notebooks. It should be easy to find this information.</p>
</li>
<li>
<p>Discover new notebooks. I would like to see new notebooks as they are submitted in order to discover new notebooks, possibly about new technologies that I did not know about.</p>
</li>
</ol>
<h2>How it works</h2>
<p>Under the hood there's a django app, for which the source code lives <a href="https://github.com/fabianp/nbgallery">here</a> (don't hesitate to use the issues feature in github to suggest features or to report bugs). To propose new notebooks there's a tab where anyone can leave the URL of a notebook. If all goes right, the django app will take a screenshot will be taken and incorporated into the database. The only non-trivial part of this process is to take the screenshot, for which I use a <a href="https://github.com/fabianp/nbgallery/blob/master/web/templates/screenshot.js">bit of javascript</a> around <a href="http://phantomjs.org/">phantomjs</a>.</p>
<h2>Future plans</h2>
<p>What you see is just the beginning!. As time permits I would like to implement user authentication so that registered users can bookmark their favorite notebooks, up and down-vote notebooks etc. Cathegorization of notebooks (e.g. in bins such as Math, Phisics, Machine Learning, R, Python, Julia, etc.) is also high on my list. Leave me a comment if you would like to see some specific feature!</p>PyData Paris - April 20152015-04-07T00:00:00+02:002015-04-07T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2015-04-07:/blog/2015/pydata-paris-april-2015/<p>Last Friday was <a href="http://pydataparis.joinux.org/schedule.html">PyData Paris</a>, in words of the organizers, ''a gathering of users and developers of data analysis tools in Python''. </p>
<p><img width="600px" src="http://pydataparis.joinux.org/static/images/PyDataLogoBig-Paris2015.png" /></p>
<p>The organizers did a great job in putting together and the event started already with a full room for <a href="http://gael-varoquaux.info/">Gael's</a> keynote</p>
<div style="text-align: center">
<img width="600px" alt="Gael's keynote" src="https://pbs.twimg.com/media/CBplCb_WIAEzytd.jpg" /></div>
<p>My take-away message from the talks is …</p><p>Last Friday was <a href="http://pydataparis.joinux.org/schedule.html">PyData Paris</a>, in words of the organizers, ''a gathering of users and developers of data analysis tools in Python''. </p>
<p><img width="600px" src="http://pydataparis.joinux.org/static/images/PyDataLogoBig-Paris2015.png" /></p>
<p>The organizers did a great job in putting together and the event started already with a full room for <a href="http://gael-varoquaux.info/">Gael's</a> keynote</p>
<div style="text-align: center">
<img width="600px" alt="Gael's keynote" src="https://pbs.twimg.com/media/CBplCb_WIAEzytd.jpg" /></div>
<p>My take-away message from the talks is that Python has grown in 5 years from a language marginally used in some research environments into one of the main languages for data science used both in research labs and industrial environment.</p>
<p>My personal highlights were (note that there were two parallel tracks)</p>
<ul>
<li>
<p><a href="http://ianozsvald.com/">Ian Ozsvald's</a> talk on <a href="http://ianozsvald.com/2015/04/03/pydataparis-2015-and-cleaning-confused-collections-of-characters/">Cleaning Confused Collections of Characters</a>. Ian gave a very practical talk, full of real world examples. The slides have already been uploaded on <a href="http://ianozsvald.com/2015/04/03/pydataparis-2015-and-cleaning-confused-collections-of-characters/">his website</a>. Many tips and many pointers to libraries. In particular, I discovered <a href="http://ftfy.readthedocs.org">fixes text for you</a>.</p>
</li>
<li>
<p><a href="http://cazencott.info/">Chloe-Agathe</a> gave a short talk on <a href="http://dreamchallenges.org/">DREAM challenges</a>. In her talk she mentioned <a href="http://sheffieldml.github.io/GPy/">GPy</a>. One year ago, I visited <a href="http://inverseprobability.com/">Neil Lawrence</a> at his lab in Sheffield and at that point they were in the process of migrating their Matlab codebase into Python (the GPy project). I'm very glad to see that the project is succeeding and being used by other research institutions.</p>
</li>
<li>
<p><a href="http://serge.liyun.free.fr/serge/">Serge Guelton</a> and Pierrick Brunet presented “<a href="http://pythonhosted.org/pythran/">Pythran</a>: Static Compilation of Parallel Scientific Kernels”. From their own documentation: “Pythran is a python to c++ compiler for a subset of the python language. It takes a python module annotated with a few interface description and turns it into a native python module with the same interface, but (hopefully) faster”. The project seems promising although I do not have had experience as to judge the quality of their implementation.</p>
</li>
<li>
<p>Antoine Pitrou presented: “<a href="http://numba.pydata.org/">Numba</a>, a JIT compiler for fast numerical code”. I must say that I'm an avid user of Numba so of course I was looking forward to this talk. One thing I didn't know is that support for CUDA is being implemented into Numba via the <code>@cuda.jit</code> decorator. From <a href="http://docs.continuum.io/numbapro/">their website</a> it looks like this is only available in the Numba Pro version (not free).</p>
</li>
<li>
<p>Kirill Smelkov presented <a href="https://pypi.python.org/pypi/wendelin.core">wendelin.core</a>, an approach to perform out-of-core computations with numpy. Slides can be found <a href="http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret?format=pdf">here</a>.</p>
</li>
<li>
<p>Finally, <a href="https://twitter.com/francescalted">Frances Alted</a> gave the final keynote on “New Trends In Storing And Analyzing Large Data Silos With Python”. Among the projects he mentioned, I found particularly interesting <a href="http://bcolz.blosc.org/">bcolz</a>, his current main project and <a href="https://github.com/aterrel/dynd-python">DyND</a>, a Python wrapper around a multi-dimensional array library.</p>
</li>
</ul>Data-driven hemodynamic response function estimation2014-12-05T00:00:00+01:002014-12-05T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2014-12-05:/blog/2014/data-driven-hemodynamic-response-function-estimation/<p>My <a href="http://www.sciencedirect.com/science/article/pii/S1053811914008027">latest research paper</a>[<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] deals with the estimation of the hemodynamic response function (HRF) from fMRI data. </p>
<div style="text-align: center">
<img width="600px" src="/images/2014/graphical_abstract.jpg" />
</div>
<p>This is an important topic since the knowledge of a hemodynamic response function is what makes it possible to extract the brain activation maps that are used in most of the impressive …</p><p>My <a href="http://www.sciencedirect.com/science/article/pii/S1053811914008027">latest research paper</a>[<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] deals with the estimation of the hemodynamic response function (HRF) from fMRI data. </p>
<div style="text-align: center">
<img width="600px" src="/images/2014/graphical_abstract.jpg" />
</div>
<p>This is an important topic since the knowledge of a hemodynamic response function is what makes it possible to extract the brain activation maps that are used in most of the impressive applications of machine learning to fMRI, such as (but not limited to) the reconstruction of visual images from brain activity [<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>] [<sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup>] or the decoding of numbers [<sup id="fnref:4"><a class="footnote-ref" href="#fn:4">4</a></sup>].</p>
<p>Besides the more traditional paper that describes the method, I've put online the <a href="https://pypi.python.org/pypi/hrf_estimation">code I used for the experiments</a>. The code at this stage is far from perfect but it should help in reproducing the results or improving the method. I've also put online an <a href="http://nbviewer.ipython.org/github/fabianp/hrf_estimation/blob/master/examples/hrf_estimation%20example.ipynb">ipython notebook</a> with the analysis of a small piece of data. I'm obviously glad to receive feedback/bug reports/patches for this code.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Pedregosa, Fabian, et al. <a href="http://www.sciencedirect.com/science/article/pii/S1053811914008027">"Data-driven HRF estimation for encoding and decoding models."</a>, Neuroimage, (2014). Also available <a href="http://arxiv.org/abs/1402.7015">here</a> as an arXiv preprint. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Miyawaki, Yoichi et al., <a href="http://www.sciencedirect.com/science/article/pii/S0896627308009586">"Visual Image Reconstruction from Human Brain Activity using a Combination of Multiscale Local Image Decoders", Neuron (2008)</a>. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>Kay, Kendrick N., et al. "Identifying natural images from human brain activity." Nature 452.7185 (2008). <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>Eger, Evelyn, et al. <a href="http://www.sciencedirect.com/science/article/pii/S0960982209016236">"Deciphering cortical number coding from human brain activity patterns."</a> Current Biology 19.19 (2009). <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
</ol>
</div>Plot memory usage as a function of time2014-11-07T00:00:00+01:002014-11-07T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2014-11-07:/blog/2014/plot-memory-usage-as-a-function-of-time/<p>:og_image: http://fa.bianp.net/blog/images/2014/mprof_example.png</p>
<p>One of the lesser known features of the <a href="https://pypi.python.org/pypi/memory_profiler">memory_profiler package</a> is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now at Google.</p>
<p>With …</p><p>:og_image: http://fa.bianp.net/blog/images/2014/mprof_example.png</p>
<p>One of the lesser known features of the <a href="https://pypi.python.org/pypi/memory_profiler">memory_profiler package</a> is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now at Google.</p>
<p>With this feature it is possible to generate very easily a plot of the memory consumption as a function of time. The result will be something like this:</p>
<p><img src="/blog/images/2014/mprof_example.png" width="800px"/></p>
<p>where you can see the memory used (in the y-axis) as a function of time (x-axis). Furthermore, we have used two functions, <code>test1</code> and <code>test2</code>, and it is possible to see with the colored brackets at what time do these functions start and finish.</p>
<p>This plot was generated with the following simple script:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">time</span>
<span class="nd">@profile</span>
<span class="k">def</span> <span class="nf">test1</span><span class="p">():</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">n</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">a</span>
<span class="nd">@profile</span>
<span class="k">def</span> <span class="nf">test2</span><span class="p">():</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">100000</span>
<span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">n</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">b</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">test1</span><span class="p">()</span>
<span class="n">test2</span><span class="p">()</span>
</code></pre></div>
<p>what happens here is that we have two functions, <code>test1()</code> and <code>test2()</code> in which we create two lists of different sizes (the one in <code>test2</code> is bigger). We call time.sleep() for one second so that the function does not return too soon and so we have time to get reliable memory measurements.</p>
<p>The decorator <code>@profile</code> is optional and is useful so that <code>memory_profiler</code> knows when the function has been called so he can plot the brackets indicating that. If you don't put the decorator, the example will work just fine except that the brackets will not appear in your plot.</p>
<p>Suppose we have saved the script as <code>test1.py</code>. We run the script as</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>mprof<span class="w"> </span>run<span class="w"> </span>test1.py
</code></pre></div>
<p>where mprof is an executable provided by memory_profiler. If the above command was successful it will print something like this</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>mprof<span class="w"> </span>run<span class="w"> </span>test1.py
mprof:<span class="w"> </span>Sampling<span class="w"> </span>memory<span class="w"> </span>every<span class="w"> </span><span class="m">0</span>.1s
running<span class="w"> </span>as<span class="w"> </span>a<span class="w"> </span>Python<span class="w"> </span>program...
</code></pre></div>
<p>The above command will create a <code>.dat</code> file on your current working directory, something like <code>mprofile_20141108113511.dat</code>. This file (you can inspect it, it's a text file) contains the memory measurements for your program.</p>
<p>You can now plot the memory measurements with the command</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>mprof<span class="w"> </span>plot
</code></pre></div>
<p>This will open a matplotlib window and show you the plot:</p>
<p><img src="/blog/images/2014/mprof_example2.png" width="800px"/></p>
<p>As you see, attention has been paid to the default values so that the plot it generates already looks decent without much effort. The not-so-nice-part is that, at least as of November 2014, if you want to customize the plot, well, you'll have to look and modify the mprof script. Some refactoring is still needed in order to make it easier to customize the plots (work in progress).</p>Surrogate Loss Functions in Machine Learning2014-06-20T00:00:00+02:002014-06-20T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2014-06-20:/blog/2014/surrogate-loss-functions-in-machine-learning/<!-- <div style="float: left; margin: 20px; width; 200px" >
<img src="http://upload.wikimedia.org/wikipedia/commons/4/46/R._A._Fischer.jpg" />
<p>Sir R. A. Fisher. Source: Wikipedia </p>
</div>
-->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p><span class="bold">TL; DR</span> These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is …</p><!-- <div style="float: left; margin: 20px; width; 200px" >
<img src="http://upload.wikimedia.org/wikipedia/commons/4/46/R._A._Fischer.jpg" />
<p>Sir R. A. Fisher. Source: Wikipedia </p>
</div>
-->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p><span class="bold">TL; DR</span> These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.</p>
<p>In the binary-class classification setting we are given $n$ training samples $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$, where $X_i$ belongs to some sample space $\mathcal{X}$, usually $\mathbb{R}^p$ but for the purpose of this post we can keep i abstract, and $y_i \in \{-1, 1\}$ is an integer representing the class label.</p>
<p>We are also given a loss function $\ell: \{-1, 1\} \times \{-1, 1\} \to \mathbb{R}$ that measures the error of a given prediction. The value of the loss function $\ell$ at an arbitrary point $(y, \hat{y})$ is interpreted as the cost incurred by predicting $\hat{y}$ when the true label is $y$. In classification this function is often the zero-one loss, that is, $\ell(y, \hat{y})$ is zero when $y = \hat{y}$ and one otherwise.</p>
<p>The goal is to find a function $h: \mathcal{X} \to [k]$, the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function $h$ that minimizes the expected $\ell$-risk, given by
$$
\mathcal{R}_{\ell}(h) = \mathbb{E}_{X \times Y}[\ell(Y, h(X))]
$$</p>
<p>In theory, we could directly minimize the $\ell$-risk and we would have the optimal classifier, also known as <em>Bayes predictor</em>. However, there are several problems associated with this approach. One is that the probability distribution of $X \times Y$ is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function $\ell$ is discontinuous. Take for example a problem in which $\mathcal{X} = \mathbb{R}^2, k=2$, and we seek to find the linear function $f(X) = \text{sign}(X w), w \in \mathbb{R}^2$ and that minimizes the $\ell$-risk. As a function of the parameter $w$ this function looks something like</p>
<div style="text-align: center">
<img style="margin-top: 0px;" width="350px" src="/blog/images/2014/loss_01.png" alt="loss as function of w"/>
</div>
<p>This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a <em>surrogate loss function</em>. For computational reasons this is usually convex function $\Psi: \mathbb{R} \to \mathbb{R}_+$. An example of such surrogate loss functions is the <em>hinge loss</em>, $\Psi(t) = \max(1-t, 0)$, which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, $\Psi(t) = 1/(1 + \exp(-t))$, used by the logistic regression model. If we consider the logistic loss, minimizing the $\Psi$-risk, given by $\mathbb{E}_{X \times Y}[\Psi(Y, f(X))]$, of the function $f(X) = X w$ becomes a much more more tractable optimization problem:</p>
<div style="text-align: center">
<img style="margin-top: 0px;" width="350px" src="/blog/images/2014/loss_log.png" />
</div>
<p>In short, we have replaced the $\ell$-risk which is computationally difficult to optimize with the $\Psi$-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the $\Psi$-risk leads to a function that also minimizes the $\ell$-risk is often referred to as <em>consistency</em> or <em>calibration</em>. For a more formal definition see [<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] and [<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>]. This property will depend on the surrogate function $\Psi$: for some functions $\Psi$ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [<sup id="fnref2:1"><a class="footnote-ref" href="#fn:1">1</a></sup>] and states that if $\Psi$ is convex then it is consistent if and only if it is differentiable at zero and $\Psi'(0) < 0$. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.</p>
<div style="text-align: center">
<img style="margin-top: 0px;" width="550px" src="/blog/images/2014/loss_functions.png" />
</div>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref2:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Different ways to get memory consumption or lessons learned from ``memory_profiler``2013-07-25T00:00:00+02:002013-07-25T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2013-07-25:/blog/2013/different-ways-to-get-memory-consumption-or-lessons-learned-from-memory_profiler/<p>As part of the development of
<a href="https://pypi.python.org/pypi/memory_profiler">memory_profiler</a> I've tried
several ways to get memory usage of a program from within Python. In this post
I'll describe the different alternatives I've tested.</p>
<h3>The psutil library</h3>
<p><a href="https://code.google.com/p/psutil/">psutil</a> is a python library that provides
an interface for retrieving information on running processes. It …</p><p>As part of the development of
<a href="https://pypi.python.org/pypi/memory_profiler">memory_profiler</a> I've tried
several ways to get memory usage of a program from within Python. In this post
I'll describe the different alternatives I've tested.</p>
<h3>The psutil library</h3>
<p><a href="https://code.google.com/p/psutil/">psutil</a> is a python library that provides
an interface for retrieving information on running processes. It provides
convenient, fast and cross-platform functions to access the memory usage of a
Python module:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">memory_usage_psutil</span><span class="p">():</span>
<span class="c1"># return the memory usage in MB</span>
<span class="kn">import</span> <span class="nn">psutil</span>
<span class="n">process</span> <span class="o">=</span> <span class="n">psutil</span><span class="o">.</span><span class="n">Process</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">getpid</span><span class="p">())</span>
<span class="n">mem</span> <span class="o">=</span> <span class="n">process</span><span class="o">.</span><span class="n">get_memory_info</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">20</span><span class="p">)</span>
<span class="k">return</span> <span class="n">mem</span>
</code></pre></div>
<p>The above function returns the memory usage of the current Python process in
MiB. Depending on the platform it will choose the most accurate and fastest
way to get this information. For example, in Windows it will use the C++ Win32
API while in Linux it will read from <code>/proc</code>, hiding the implementation
details and proving on each platform a fast and accurate measurement.</p>
<p>If you are looking for an easy way to get the memory consumption within Python
this in my opinion your best shot. </p>
<h3>The resource module</h3>
<p>The <a href="http://docs.python.org/2/library/resource.html">resource module</a> is part
of the standard Python library. It's basically a wrapper around
<sys/resource.h> <a href="and
http://pubs.opengroup.org/onlinepubs/007904975/functions/getrusage.html">getrusage</a>,
which is a POSIX standard but some methods are <a href="http://linux.die.net/man/2/getrusage">still missing in
Linux</a> . However, the ones we are
interested seem to work fine in Ubuntu 10.04. You can get the memory usage
with this function:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">memory_usage_resource</span><span class="p">():</span>
<span class="kn">import</span> <span class="nn">resource</span>
<span class="n">rusage_denom</span> <span class="o">=</span> <span class="mf">1024.</span>
<span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">platform</span> <span class="o">==</span> <span class="s1">'darwin'</span><span class="p">:</span>
<span class="c1"># ... it seems that in OSX the output is different units ...</span>
<span class="n">rusage_denom</span> <span class="o">=</span> <span class="n">rusage_denom</span> <span class="o">*</span> <span class="n">rusage_denom</span>
<span class="n">mem</span> <span class="o">=</span> <span class="n">resource</span><span class="o">.</span><span class="n">getrusage</span><span class="p">(</span><span class="n">resource</span><span class="o">.</span><span class="n">RUSAGE_SELF</span><span class="p">)</span><span class="o">.</span><span class="n">ru_maxrss</span> <span class="o">/</span> <span class="n">rusage_denom</span>
<span class="k">return</span> <span class="n">mem</span>
</code></pre></div>
<p>In my experience this approach is several times faster than the one based in
psutil as was the default way to get the memory usage that I used in
<code>memory_profiler</code> from version 0.23 up to 0.26. I changed this behavior in
0.27 after a bug report by <a href="https://github.com/pgervais">Philippe Gervais</a>.
The problem with this approach is that it seems to report results that are
slightly different in some cases. Notably it seems to differ when objects
have been recently liberated from the python interpreter.</p>
<p>In the following example, orphaned arrays are liberated by the python
interpreter, which is correctly seen by <code>psutil</code> but not by <code>resource</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">mem_resource</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">mem_psutil</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">21</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">1000</span> <span class="o">*</span> <span class="n">i</span><span class="p">,</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">i</span><span class="p">))</span>
<span class="n">mem_resource</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">memory_usage_resource</span><span class="p">())</span>
<span class="n">mem_psutil</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">memory_usage_psutil</span><span class="p">())</span>
</code></pre></div>
<p><img alt="Memory plot" src="/blog/static/code/2013/resource_vs_psutil.png"></p>
<p>By the way I would be delighted to be corrected if I'm doing something wrong
or informed of a workaround if this exists (I've got the code to reproduce the
figures <sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>)</p>
<h3>querying <code>ps</code> directly</h3>
<p>The method based on <code>psutils</code> works great but is not available by default on all
Python systems. Because of this in <code>memory_profiler</code> we use as last resort
something that's pretty ugly but works reasonably well when all else fails:
invoking the system's <code>ps</code> command and parsing the output. The code is
something like::</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">memory_usage_ps</span><span class="p">():</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">Popen</span><span class="p">([</span><span class="s1">'ps'</span><span class="p">,</span> <span class="s1">'v'</span><span class="p">,</span> <span class="s1">'-p'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">getpid</span><span class="p">())],</span>
<span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">)</span><span class="o">.</span><span class="n">communicate</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">b</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="n">vsz_index</span> <span class="o">=</span> <span class="n">out</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">()</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="sa">b</span><span class="s1">'RSS'</span><span class="p">)</span>
<span class="n">mem</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">out</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">()[</span><span class="n">vsz_index</span><span class="p">])</span> <span class="o">/</span> <span class="mi">1024</span>
<span class="k">return</span> <span class="n">mem</span>
</code></pre></div>
<p>The main disadvantage of this approach is that it needs to fork a process for
each measurement. For some tasks where you need to get memory usage very fast,
like in line-by-line memory usage then this be a huge overhead on the code.
For other tasks such as getting information of long-running processes, where
the memory usage is anyway working on a separate process this is not too bad.</p>
<h3>benchmarks</h3>
<p>Here is a benchmark of the different alternatives presented above. I am
plotting the time it takes the different approaches to make 100 measurements
of the memory usage (lower is better). As can be seen the smallest one is
resource (although it suffers from the issues described above) followed
closely by <code>psutil</code> which is in my opinion the best option if you can count on
it being installed on the host system and followed far away by <code>ps</code> which is
roughly a hundred times slower than <code>psutil</code>.</p>
<p><img alt="Memory plot" src="/blog/static/code/2013/time_100_measurements.png"></p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>IPython notebook to reproduce the figures: <a href="http://nbviewer.ipython.org/url/fa.bianp.net/blog/static/code/2013/memory_usage.ipynb">html</a> <a href="http://fa.bianp.net/blog/static/code/2013/memory_usage.ipynb">ipynb</a> <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Numerical optimizers for Logistic Regression2013-05-20T00:00:00+02:002013-05-20T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2013-05-20:/blog/2013/numerical-optimizers-for-logistic-regression/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>In this post I compar several implementations of
Logistic Regression. The task was to implement a Logistic Regression model
using standard optimization …</p><script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>In this post I compar several implementations of
Logistic Regression. The task was to implement a Logistic Regression model
using standard optimization tools from <code>scipy.optimize</code> and compare
them against state of the art implementations such as
<a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a>.</p>
<p>In this blog post I'll write down all the implementation details of this
model, in the hope that not only the conclusions but also the process would be
useful for future comparisons and benchmarks.</p>
<h2>Function evaluation</h2>
<p>We consider the case in which the decision function is an affine function, i.e., $f(x) = \langle x, w \rangle + c$ where $w$ and $c$ are parameters to estimate. The loss function for the $\ell_2$-regularized logistic regression, i.e. the
function to be minimized is</p>
<p>$$
\mathcal{L}(w, \lambda, X, y) = - \frac{1}{n}\sum_{i=1}^n \log(\phi(y_i (\langle X_i, w \rangle + c))) + \frac{\lambda}{2} w^T w
$$</p>
<p>where $\phi(t) = 1. / (1 + \exp(-t))$ is the <a href="http://en.wikipedia.org/wiki/Logistic_function">logistic
function</a>, $\lambda w^T w$ is
the regularization term and $X, y$ is the input data, with $X \in
\mathbb{R}^{n \times p}$ and $y \in \{-1, 1\}^n$. However, this formulation is
not great from a practical standpoint. Even for not unlikely values of $t$
such as $t = -100$, $\exp(100)$ will overflow, assigning the loss an
(erroneous) value of $+\infty$. For this reason <sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>, we evaluate
$\log(\phi(t))$ as</p>
<p>$$
\log(\phi(t)) =
\begin{cases}
- \log(1 + \exp(-t)) \text{ if } t > 0 \\
t - \log(1 + \exp(t)) \text{ if } t \leq 0\\
\end{cases}
$$</p>
<p>The gradient of the loss function is given by</p>
<p>$$\begin{aligned}
\nabla_w \mathcal{L} &= \frac{1}{n}\sum_{i=1}^n y_i X_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) + \lambda w \\
\nabla_c \mathcal{L} &= \sum_{i=1}^n y_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1)
\end{aligned}$$</p>
<p>Similarly, the logistic function $\phi$ used here can be computed in a more
stable way using the formula</p>
<p>$$
\phi(t) = \begin{cases}
1 / (1 + \exp(-t)) \text{ if } t > 0 \\
\exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\
\end{cases}
$$</p>
<p>Finally, we will also need the Hessian for some second-order methods, which is given by</p>
<p>$$
\nabla_w ^2 \mathcal{L} = X^T D X + \lambda I
$$</p>
<p>where $I$ is the identity matrix and $D$ is a diagonal matrix given by $D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i))$.</p>
<p>In Python, these function can be written as</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">phi</span><span class="p">(</span><span class="n">t</span><span class="p">):</span>
<span class="c1"># logistic function, returns 1 / (1 + exp(-t))</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">t</span> <span class="o">></span> <span class="mi">0</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">(</span><span class="n">t</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float</span><span class="p">)</span>
<span class="n">out</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">t</span><span class="p">[</span><span class="n">idx</span><span class="p">]))</span>
<span class="n">exp_t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">])</span>
<span class="n">out</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">exp_t</span> <span class="o">/</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">+</span> <span class="n">exp_t</span><span class="p">)</span>
<span class="k">return</span> <span class="n">out</span>
<span class="k">def</span> <span class="nf">loss</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span>
<span class="c1"># logistic loss function, returns Sum{-log(phi(t))}</span>
<span class="n">w</span><span class="p">,</span> <span class="n">c</span> <span class="o">=</span> <span class="n">x0</span><span class="p">[:</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">x0</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="n">c</span>
<span class="n">yz</span> <span class="o">=</span> <span class="n">y</span> <span class="o">*</span> <span class="n">z</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">yz</span> <span class="o">></span> <span class="mi">0</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">yz</span><span class="p">)</span>
<span class="n">out</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">yz</span><span class="p">[</span><span class="n">idx</span><span class="p">]))</span>
<span class="n">out</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="n">yz</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">]</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">yz</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">])))</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">out</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="mf">.5</span> <span class="o">*</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">w</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="k">return</span> <span class="n">out</span>
<span class="k">def</span> <span class="nf">gradient</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span>
<span class="c1"># gradient of the logistic loss</span>
<span class="n">w</span><span class="p">,</span> <span class="n">c</span> <span class="o">=</span> <span class="n">x0</span><span class="p">[:</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">x0</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="n">c</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">phi</span><span class="p">(</span><span class="n">y</span> <span class="o">*</span> <span class="n">z</span><span class="p">)</span>
<span class="n">z0</span> <span class="o">=</span> <span class="p">(</span><span class="n">z</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">y</span>
<span class="n">grad_w</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">z0</span><span class="p">)</span> <span class="o">/</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">w</span>
<span class="n">grad_c</span> <span class="o">=</span> <span class="n">z0</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">((</span><span class="n">grad_w</span><span class="p">,</span> <span class="p">[</span><span class="n">grad_c</span><span class="p">]))</span>
</code></pre></div>
<h2>Benchmarks</h2>
<p>I tried several methods to estimate this $\ell_2$-regularized logistic regression. There is
one first-order method (that is, it only makes use of the gradient and not of
the Hessian), <a href="http://en.wikipedia.org/wiki/Nonlinear_conjugate_gradient_method">Conjugate
Gradient</a>
whereas all the others are <a href="http://en.wikipedia.org/wiki
/Quasi-Newton_methods">Quasi-Newton methods</a>. The method I tested are:</p>
<ul>
<li><strong>CG</strong> = Conjugate Gradient as implemented in <code>scipy.optimize.fmin_cg</code></li>
<li><strong>TNC</strong> = Truncated Newton as implemented in <code>scipy.optimize.fmin_tnc</code></li>
<li><strong>BFGS</strong> = Broyden–Fletcher–Goldfarb–Shanno method, as implemented in <code>scipy.optimize.fmin_bfgs</code>.</li>
<li><strong>L-BFGS</strong> = Limited-memory <a href="http://en.wikipedia.org/wiki/BFGS_method">BFGS</a> as implemented in <code>scipy.optimize.fmin_l_bfgs_b</code>. Contrary to the BFGS algorithm, which is written in Python, this one wraps a C implementation.</li>
<li><strong>Trust Region</strong> = Trust Region Newton method <sup id="fnref2:1"><a class="footnote-ref" href="#fn:1">1</a></sup>. This is the solver used by <a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a> that I've wrapped to accept any Python function in the package <a href="http://github.com/fabianp/pytron/">pytron</a></li>
</ul>
<p>To assure the most accurate results across implementations, all timings were
collected by callback functions that were called from the algorithm on each
iteration. Finally, I plot the maximum absolute value of the gradient (=the
infinity norm of the gradient) with respect to time.</p>
<p>The synthetic data used in the benchmarks was generated as described in <sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup> and consists
primarily of the design matrix $X$ being Gaussian noise, the vector of
coefficients is drawn also from a Gaussian distribution and the explained
variable $y$ is generated as $y = \text{sign}(X w)$. We then perturb matrix
$X$ by adding Gaussian noise with covariance 0.8. The number of samples and features
was fixed to $10^4$ and $10^3$ respectively. The penalization parameter $\lambda$ was
fixed to 1.</p>
<p>In this setting variables are typically uncorrelated and most solvers perform
decently:</p>
<p><img alt="Benchmark Logistic" src="/blog/static/images/2013/comparison_logistic_corr_0.png"></p>
<p>Here, the Trust Region and L-BFGS solver perform almost equally good, with
Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised
by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.</p>
<p>To make things more interesting, we now make the design to be slightly more
correlated. We do so by adding a constant term of 1 to the matrix $X$ and
adding also a column vector of ones this matrix to account for the intercept. These are the results:</p>
<p><img alt="Benchmark Logistic" src="/blog/static/images/2013/comparison_logistic_corr_1.png"></p>
<p>Here, we already see that second-order methods dominate over first-order
methods (well, except for BFGS), with Trust Region clearly dominating the
picture but with TNC not far behind.</p>
<p>Finally, if we force the matrix to be even more correlated (we add 10. to the
design matrix $X$), then we have:</p>
<p><img alt="Benchmark Logistic" src="/blog/static/images/2013/comparison_logistic_corr_10.png"></p>
<p>Here, the Trust-Region method has the same timing as before, but all other
methods have got substantially worse.The Trust Region
method, unlike the other methods is surprisingly robust to correlated designs.</p>
<p>To sum up, the Trust Region method performs extremely well for optimizing the
Logistic Regression model under different conditionings of the design matrix.
The <a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a> software uses
this solver and thus has similar performance, with the sole exception that the
evaluation of the logistic function and its derivatives is done in C++ instead
of Python. In practice, however, due to the small number of iterations of this
solver I haven't seen any significant difference.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>A similar development can be found in the source code of <a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a>, and is probably also used elsewhere. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref2:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>"A comparison of numerical optimizers for logistic regression", P. Minka, <a href="http://research.microsoft.com/en-us/um/people/minka/papers/logreg/">URL</a> <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>"Newton's Method for Large Bound-Constrained Optimization Problems", Chih-Jen Lin, Jorge J. More <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.7340">URL</a> <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p><a href="http://nbviewer.ipython.org/urls/raw.github.com/fabianp/pytron/master/doc/benchmark_logistic.ipynb">IPython Notebook to reproduce the benchmarks</a> <a href="https://github.com/fabianp/pytron/blob/master/doc/benchmark_logistic.ipynb">source</a> <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
</ol>
</div>Logistic Ordinal Regression2013-05-02T00:00:00+02:002013-05-02T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2013-05-02:/blog/2013/logistic-ordinal-regression/<p><strong>TL;DR: I've implemented a logistic ordinal regression or
proportional odds model. <a href="http://github.com/fabianp/minirank/blob/master/minirank/logistic.py">Here is the Python code</a></strong></p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>The <em>logistic ordinal regression</em> model …</p><p><strong>TL;DR: I've implemented a logistic ordinal regression or
proportional odds model. <a href="http://github.com/fabianp/minirank/blob/master/minirank/logistic.py">Here is the Python code</a></strong></p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>The <em>logistic ordinal regression</em> model, also known as the
proportional odds was introduced in the early 80s by McCullagh [<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>, <sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>]
and is a generalized linear model specially tailored for the case of
predicting ordinal variables, that is, variables that are discrete (as
in classification) but which can be ordered (as in regression). It can
be seen as an extension of the logistic regression model to the
ordinal setting.</p>
<p>Given $X \in \mathbb{R}^{n \times p}$ input data and $y \in
\mathbb{N}^n$ target values. For simplicity we assume $y$ is a
non-decreasing vector, that is, $y_1 \leq y_2 \leq ...$. Just as the
logistic regression models posterior probability $P(y=j|X_i)$ as the
logistic function, in the logistic ordinal regression we model the
<em>cummulative</em> probability as the logistic function. That is,</p>
<p>$$
P(y \leq j|X_i) = \phi(\theta_j - w^T X_i) = \frac{1}{1 + \exp(w^T X_i - \theta_j)}
$$</p>
<p>where $w, \theta$ are vectors to be estimated from the data and $\phi$
is the logistic function defined as $\phi(t) = 1 / (1 + \exp(-t))$.</p>
<figure style="float: left; width: 380px; margin: 0px 15px 15px 0px">
<img src="/blog/static/images/2013/ordinal_1.png"/>
<img src="/blog/static/images/2013/ordinal_logistic.png"/>
<figcaption style="margin-left: 10px; margin-right: 5px">Toy example with three classes denoted in different colors. Also shown the vector of coefficients $w$ and the thresholds $\theta_0$ and $\theta_1$</figcaption>
</figure>
<p>Compared to multiclass logistic regression, we have added the
constrain that the hyperplanes that separate the different classes are
<em>parallel</em> for all classes, that is, the vector $w$ is common across
classes. To decide to which class will $X_i$ be predicted we make use
of the vector of thresholds $\theta$. If there are $K$ different
classes, $\theta$ is a non-decreasing vector (that is, $\theta_1 \leq
\theta_2 \leq ... \leq \theta_{K-1}$) of size $K-1$. We will then
assign the class $j$ if the prediction $w^T X$ (recall that it's a
linear model) lies in the interval $[\theta_{j-1}, \theta_{j}[$. In
order to keep the same definition for extremal classes, we define
$\theta_{0} = - \infty$ and $\theta_K = + \infty$.</p>
<p>The intuition is that we are seeking a vector $w$ such that $X w$
produces a set of values that are well separated into the different
classes by the different thresholds $\theta$. We choose a logistic
function to model the probability $P(y \leq j|X_i)$ but other choices
are possible. In the proportional hazards model <sup id="fnref2:1"><a class="footnote-ref" href="#fn:1">1</a></sup> the probability
is modeled as $-\log(1 - P(y \leq j | X_i)) = \exp(\theta_j - w^T
X_i)$. Other link functions are possible, where the link function
satisfies $\text{link}(P(y \leq j | X_i)) = \theta_j - w^T X_i$. Under
this framework, the logistic ordinal regression model has a logistic
link function and the proportional hazards model has a log-log link
function.</p>
<p>The logistic ordinal regression model is also known as the
proportional odds model, because the
<a href="http://en.wikipedia.org/wiki/Odds_ratio">ratio of corresponding odds</a>
for two different samples $X_1$ and $X_2$ is $\exp(w^T(X_1 - X_2))$ and
so does not depend on the class $j$ but only on the difference between
the samples $X_1$ and $X_2$.</p>
<h3>Optimization</h3>
<p>Model estimation can be posed as an optimization problem. Here, we
minimize the loss function for the model, defined as minus the
log-likelihood:</p>
<p>$$
\mathcal{L}(w, \theta) = - \sum_{i=1}^n \log(\phi(\theta_{y_i} - w^T X_i) - \phi(\theta_{y_i -1} - w^T X_i))
$$</p>
<p>In this sum all terms are convex on $w$, thus the loss function is
convex over $w$. It might be also jointly convex over $w$ and
$\theta$, although I haven't checked. I use the function
<code>fmin_slsqp</code> in <code>scipy.optimize</code> to optimize
$\mathcal{L}$ under the constraint that $\theta$ is a non-decreasing
vector. There might be better options, I don't know. If you do know,
please leave a comment!.</p>
<p>Using the formula $\log(\phi(t))^\prime = (1 - \phi(t))$, we can compute the gradient of the loss function as</p>
<p>$\begin{align}
\nabla_w \mathcal{L}(w, \theta) &= \sum_{i=1}^n X_i (1 - \phi(\theta_{y_i} - w^T X_i) - \phi(\theta_{y_i-1} - w^T X_i)) \\
% \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n - \frac{e_{y_i} \exp(\theta_{y_i}) - e_{y_i -1} \exp(\theta_{y_i -1})}{\exp(\theta_{y_i}) - \exp(\theta_{y_i-1})} \\
\nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n e_{y_i} \left(1 - \phi(\theta_{y_i} - w^T X_i) - \frac{1}{1 - \exp(\theta_{y_i -1} - \theta_{y_i})}\right) \\
& \qquad + e_{y_i -1}\left(1 - \phi(\theta_{y_i -1} - w^T X_i) - \frac{1}{1 - \exp(- (\theta_{y_i-1} - \theta_{y_i}))}\right)
\end{align}$</p>
<p>where $e_i$ is the $i$th canonical vector.</p>
<h3>Code</h3>
<p>I've implemented a Python version of this algorithm using Scipy's
<code>optimize.fmin_slsqp</code> function. This takes as arguments the
loss function, the gradient denoted before and a function that is
> 0 when the inequalities on $\theta$ are satisfied.</p>
<p><a
href="http://github.com/fabianp/minirank/blob/master/minirank/logistic.py">Code
can be found here</a> as part of the <a
href="https://github.com/fabianp/minirank">minirank</a> package, which
is my sandbox for code related to ranking and ordinal regression. At
some point I would like to submit it to scikit-learn but right now the
I don't know how the code will scale to medium-scale problems, but I
suspect not great. On top of that I'm not sure if there is a real demand
of these models for scikit-learn and I don't want to bloat the package
with unused features.</p>
<h3>Performance</h3>
<p>I compared the prediction accuracy of this model in the sense of mean absolute
error <a
href="http://nbviewer.ipython.org/urls/raw.github.com/fabianp/minirank/master/notebooks/comparison_ordinal_logistic.ipynb">(IPython
notebook)</a> on the <a
href="http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_boston.html">boston
house-prices dataset</a>. To have an ordinal variable, I
rounded the values to the closest integer, which gave me a problem of
size 506 $\times$ 13 with 46 different target values. Although not a
huge increase in accuracy, this model did give me better results on
this particular dataset:</p>
<figure>
<img src="/blog/static/images/2013/bars_ordinal.png"/>
</figure>
<p>Here, ordinal logistic regression is the best-performing model,
followed by a Linear Regression model and a One-versus-All Logistic
regression model as implemented in scikit-learn.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>"Regression models for ordinal data", P. McCullagh, Journal of
the royal statistical society. Series B (Methodological), 1980 <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a><a class="footnote-backref" href="#fnref2:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>"Generalized Linear Models", P. McCullagh and J. A. Nelder (Book) <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>"Loss Functions for Preference Levels : Regression with Discrete
Ordered Labels", Jason D. M. Rennie, Nathan Srebro <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>Isotonic Regression2013-04-16T00:00:00+02:002013-04-16T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2013-04-16:/blog/2013/isotonic-regression/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>My latest contribution for <a href="http://scikit-learn.org">scikit-learn</a> is
an implementation of the isotonic regression model that I coded with
<a href="https://twitter.com/nvaroqua">Nelle Varoquaux</a> and
<a href="http://alexandre.gramfort.net/">Alexandre Gramfort …</a></p><script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>My latest contribution for <a href="http://scikit-learn.org">scikit-learn</a> is
an implementation of the isotonic regression model that I coded with
<a href="https://twitter.com/nvaroqua">Nelle Varoquaux</a> and
<a href="http://alexandre.gramfort.net/">Alexandre Gramfort</a>. This model
finds the best least squares fit to a set of points, given the
constraint that the fit must be a non-decreasing
function. <a href="http://scikit-learn.sourceforge.net/dev/auto_examples/plot_isotonic_regression.html">The example</a>
on the scikit-learn website gives an intuition on this model.</p>
<p><img alt="isotonic regression" src="/blog/static/images/2013/plot_isotonic_regression_1.png"></p>
<p>The original points are in red, and the estimated ones are in
green. As you can see, there is one estimation (green point) for each
data sample (red point). Calling $y \in \mathbb{R}^n$ the input data,
the model can be written concisely as an optimization problem over $x$</p>
<p>$$
\text{argmin}_x |y - x |^2 \\
\text{subject to } x_0 \leq x_1 \leq \cdots \leq x_n
$$</p>
<p>The algorithm implemented in scikit-learn <sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup> is the pool adjacent
violators algorithm <sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>, which is an efficient linear time
$\mathcal{O}(n)$ algorithm. The algorithm sweeps through the data
looking for violations of the monotonicity constraint. When it finds
one, it adjusts the estimate to the best possible fit with
constraints. Sometimes it also needs to modify previous points to make
sure the new estimate does not violate the constraints. The following
picture shows how it proceeds at each iteration</p>
<p><img alt="isotonic regression" src="/blog/static/images/2013/isotonic.gif"></p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>"Active set algorithms for isotonic regression; A unifying
framework", Michael J. Best, Nilotpal Chakravarti <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Python notebook to generate the figures: <a href="/blog/static/uploads/2013/isotonic_regression_sklearn.ipynb">ipynb</a> and <a href="http://nbviewer.ipython.org/url/fa.bianp.net/blog/static/uploads/2013/isotonic_regression_sklearn.ipynb">web version</a> <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>The algorithm is used through the sklearn.isotonic.IsotonicRegression object (<a href="http://scikit-learn.org/dev/modules/generated/sklearn.isotonic.IsotonicRegression.html">doc</a>) or the function sklearn.isotonic.isotonic_regression (<a href="http://scikit-learn.org/dev/modules/generated/sklearn.isotonic.isotonic_regression.html">doc</a>) <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>Householder matrices2013-03-30T00:00:00+01:002013-03-30T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2013-03-30:/blog/2013/householder-matrices/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>Householder matrices are square matrices of the form</p>
<p>$$ P = I - \beta v v^T$$</p>
<p>where $\beta$ is a scalar and $v$ is …</p><script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>Householder matrices are square matrices of the form</p>
<p>$$ P = I - \beta v v^T$$</p>
<p>where $\beta$ is a scalar and $v$ is a vector. It has the useful
property that for suitable chosen $v$ and $\beta$ it makes the product
$P x$ to zero out all of the coordinates but one, that is, $P x =
|x| e_1$. The following code, given $x$, finds the values of $\beta,
v$ that verify that property. The algorithm can be found in several
textbooks <sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup></p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">house</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Given a vetor x, computes vectors v with v[0] = 1</span>
<span class="sd"> and scalar beta such that P = I - beta v v^T</span>
<span class="sd"> is orthogonal and P x = ||x|| e_1</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> x : array, shape (n,) or (n, 1)</span>
<span class="sd"> Returns</span>
<span class="sd"> -------</span>
<span class="sd"> beta : scalar</span>
<span class="sd"> v : array, shape (n, 1)</span>
<span class="sd"> """</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="n">x</span><span class="o">.</span><span class="n">ndim</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">]</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">:,</span> <span class="mi">0</span><span class="p">])</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">:]))</span>
<span class="k">if</span> <span class="n">sigma</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">beta</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">sigma</span><span class="p">)</span>
<span class="k">if</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o"><=</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">mu</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span> <span class="n">sigma</span> <span class="o">/</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">mu</span><span class="p">)</span>
<span class="n">beta</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">sigma</span> <span class="o">+</span> <span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">beta</span><span class="p">,</span> <span class="n">v</span>
</code></pre></div>
<p>As promised, this computes the parameters of $P$ such that $P x = |x| e_1$,
exact to 15 decimals:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="n">n</span> <span class="o">=</span> <span class="mi">5</span>
<span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">beta</span><span class="p">,</span> <span class="n">v</span> <span class="o">=</span> <span class="n">house</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">P</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">-</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">v</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
<span class="o">>>></span> <span class="nb">print</span> <span class="n">np</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="n">P</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">decimals</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="p">[</span> <span class="mf">1.</span> <span class="o">-</span><span class="mf">0.</span> <span class="o">-</span><span class="mf">0.</span> <span class="mf">0.</span> <span class="o">-</span><span class="mf">0.</span><span class="p">]</span>
</code></pre></div>
<p>This property is what it makes Householder matrices useful in
the context of numerical analysis. It can be used for example to
compute the QR decomposition of a given matrix. The idea is to
succesively zero out the sub-diagonal elements, thus leaving a
triangular matrix at the end. In the first iteration we compute a
Householder matrix $P_0$ such that $P_0 X$ has only zero below the
diagonal of the first column, then compute a Householder matrix $P_1$
such that $P_1 X$ zeroes out the subdiagonal elements of the second
column and so on. At the end we will have that $P_0 P_1 ... P_n X$ is
an upper triangular matrix. Since all $P_i$ are orthogonal, the
product $P_0 P_1 ... P_n$ is again an orthogonal matrix, namely the
$Q$ matrix in the QR decomposition.</p>
<p>If we choose X as 20-by-20 random matrix, with colors representing
different values</p>
<p><img alt="QR decomposition" src="/blog/static/images/2013/house_random.png"></p>
<p>we can see the process of the Householder matrices being applied one
by one to obtain an upper triangular matrix</p>
<p><img alt="QR decomposition" src="/blog/static/images/2013/house.gif"></p>
<p>A similar application of Householder matrices is to reduce a given
symmetric matrix to tridiagonal form, which proceeds in a similar way
as in the QR algorithm, only that now we multiply by the matrix $X$ by
the left <em>and right</em> with the Householder matrices. Also, in this case
we seek for Householder matrices that zero out the elements of the
subdiagonal plus one, instead of just subdiagonal elements. This
algorithm is used for example as a preprocessing step for <a href="http://www.netlib.org/lapack/lug/node70.html">most dense
eigensolvers</a></p>
<p><img alt="Tridiagonalization" src="/blog/static/images/2013/house_tridiag.gif"></p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>"Matrix Computations" third edition, Golub & VanLoan (Algorithm 5.1.1). <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Code to reproduce the figures <a href="http://nbviewer.ipython.org/url/fa.bianp.net/blog/static/uploads/2013/householder.ipynb">can be found here</a>, source for the IPython notebook <a href="http://fa.bianp.net/blog/static/uploads/2013/householder.ipynb">can be found here</a> <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Loss Functions for Ordinal regression2013-02-27T00:00:00+01:002013-02-27T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2013-02-27:/blog/2013/loss-functions-for-ordinal-regression/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>** Note: this post contains a fair amount of LaTeX, if you don't
visualize the math correctly come to its <a href="http://fa.bianp.net/blog/2013/loss-functions-for-ordinal-regression/">original location</a> **</p>
<p>In …</p><script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>** Note: this post contains a fair amount of LaTeX, if you don't
visualize the math correctly come to its <a href="http://fa.bianp.net/blog/2013/loss-functions-for-ordinal-regression/">original location</a> **</p>
<p>In machine learning it is common to formulate the classification task
as a minimization problem over a given loss function. Given data input
data $(x_1, ..., x_n)$ and associated labels $(y_1, ..., y_n), y_i \in
\lbrace-1, 1\rbrace$, the problem becomes to find a function $f(x)$
that minimizes</p>
<p>$$L(x, y) = \sum_i ^n \text{loss}(f(x_i), y_i)$$</p>
<p>where loss is any loss function. These are usually functions that
become close to zero when $f(x_i)$ agrees in sign with $y_i$ and have
a non-negative value when $f(x_i)$ have opposite signs. Common choices
of loss functions are:</p>
<ul>
<li>Zero-one loss, $I(f(x_i) = y_i)$, where $I$ is the indicator function.</li>
<li>Hinge loss, $\text{max}(0, 1 - f(x_i) y_i)$</li>
<li>Logistic loss, $\log(1 + \exp{f(x_i) y_i})$</li>
</ul>
<p><img alt="Loss functions" src="/blog/static/images/2013/loss_functions.png"></p>
<p><sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup></p>
<p>In the paper <a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.9242">Loss functions for preference levels: Regression with
discrete ordered
labels</a>,
the above setting that is commonly used in the classification and
regression setting is extended for the ordinal regression problem. In
ordinal regression, classes can take one of several discrete, but
ordered, labels. Think for example on movie ratings that go from zero
to ten stars. Here there's an inherent order in the sense that
(unlike the multiclass classification setting) not all errors are
equally bad. For instance, it is worse to mistake a 1-star with a
10-star than a 4-star movie with a 5-star one.</p>
<p>To extend the binary loss to the case of ordinal regression, the
author introduces K-1 thresholds $-\infty = \theta_0 \lt \theta_1 \lt
... \lt \theta_{K-1}=\infty$. Each of the K segments corresponds to one of
K labels and a predicted value between $\theta_{d}$ and $\theta_{d+1}$
corresponds to the prediction of class $d$ (supposing that classes to
from zero to $d$). This generalizes the binary case by considering
zero the unique threshold.</p>
<p>Suppose K=7 and a that the correct outcome of $y_i$ is 2, that is,
$f(x_i)$ must lie between $\theta_1$ and $\theta_2$. In that case, it
must be verified that $f(x_i) > \theta_0$ and $f(x_i) > \theta_1$ as
well as $f(x_i) < \theta_3 < \theta_4 < \theta_5$. Treating this as
K-1=6 independent classification problems produces the following
family of loss functions (for a hinge loss):</p>
<p><img alt="Ordinal loss functions" src="/blog/static/images/2013/loss_functions_ordinal.png"></p>
<p>The idea behind Rennie's paper is to sum all these loss function to
produce a single optimization problem that is then the sum of all
threshold violations. Here, not only the function increases when
thresholds are violated, but the slope of the loss function increases
each time a threshold is crossed.</p>
<p><img alt="Ordinal loss functions" src="/blog/static/images/2013/loss_functions_ordinal2.png"></p>
<!--
The full optimization problem can then be written as a minimization of
$$
L(x, y) = \sum_{i}^n \sum_{j}^{K-1} \text{loss}(f(x_i) - \theta_j, z_{ij} ))
$$
where $z_{ij} = -1 \text{ if } j \lt y_i$ and 1 otherwise.
-->
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Code for generating all figures can be found <a href="http://nbviewer.ipython.org/url/fa.bianp.net/uploads/2013/ordinal_regression_loss_functions.ipynb">here</a> <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Memory plots with memory_profiler2013-01-04T00:00:00+01:002013-01-04T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2013-01-04:/blog/2013/memory-plots-with-memory_profiler/<p>Besides performing a line-by-line analysis of memory consumption,
<a href="http://pypi.python.org/pypi/memory_profiler"><code>memory_profiler</code></a>
exposes some functions that allow to retrieve the memory consumption
of a function in real-time, allowing e.g. to visualize the memory
consumption of a given function over time.</p>
<p>The function to be used is <code>memory_usage</code>. The first argument
specifies what …</p><p>Besides performing a line-by-line analysis of memory consumption,
<a href="http://pypi.python.org/pypi/memory_profiler"><code>memory_profiler</code></a>
exposes some functions that allow to retrieve the memory consumption
of a function in real-time, allowing e.g. to visualize the memory
consumption of a given function over time.</p>
<p>The function to be used is <code>memory_usage</code>. The first argument
specifies what code is to be monitored. This can represent either an
external process or a Python function. In the case of an external
process the first argument is an integer representing its process
identifier (PID). In the case of a Python function, we need pass the
function and its arguments to memory_usage. We do this by passing the
tuple <code>(f, args, kw)</code> that specifies the function, its position
arguments as a tuple and its keyword arguments as a dictionary,
respectively. This will be then executed by <code>memory_usage</code> as
<code>f(*args, **kw)</code>.</p>
<p>Let's see this with an example. Take as function NumPy's
<a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.pinv.html">pseudo-inverse function</a>. Thus
<code>f = numpy.linalg.pinv</code> and <code>f</code> takes one positional argument (the
matrix to be inverted) so <code>args = (a,)</code> where <code>a</code> is the matrix to be
inverted. Note that args must be a tuple consisting of the different
arguments, thus the parenthesis around <code>a</code>. The third item is a
dictionary <code>kw</code> specifying the keyword arguments. Here kw is optional
and is omitted.</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">memory_profiler</span> <span class="kn">import</span> <span class="n">memory_usage</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># create a random matrix</span>
<span class="o">>>></span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">500</span><span class="p">,</span> <span class="mi">500</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">mem_usage</span> <span class="o">=</span> <span class="n">memory_usage</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">pinv</span><span class="p">,</span> <span class="p">(</span><span class="n">a</span><span class="p">,)),</span> <span class="n">interval</span><span class="o">=</span><span class="mf">.01</span><span class="p">)</span>
<span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">mem_usage</span><span class="p">)</span>
<span class="p">[</span><span class="mf">57.02734375</span><span class="p">,</span> <span class="mf">55.0234375</span><span class="p">,</span> <span class="mf">57.078125</span><span class="p">,</span> <span class="o">...</span><span class="p">]</span>
</code></pre></div>
<p>This has given me a list specifying at different time intervals <code>(t0,
t0 + .01, t0 + .02, ...)</code> at which the measurements where taken. Now I can
use that to for example plot the memory consumption as a function of
time:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">pylab</span> <span class="k">as</span> <span class="nn">pl</span>
<span class="o">>>></span> <span class="n">pl</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">mem_usage</span><span class="p">))</span> <span class="o">*</span> <span class="mf">.01</span><span class="p">,</span> <span class="n">mem_usage</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'linalg.pinv'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">pl</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Time (in seconds)'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">pl</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Memory consumption (in MB)'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img alt="Memory plot" src="/blog/static/images/2013/plot_memory_pinv.png"></p>
<p>This will give the memory usage of a single function across time, which
might be interesting for example to detect temporaries that would be
created during the execution. </p>
<p>Another use case for <code>memory_usage</code> would be to see how memory behaves
as input data gets bigger. In this case we are interested in memory as
a function of the input data. One obvious way we can do this is by
calling the same function each time with a different input and take as
memory consumption the maximum consumption over time. This way we will
have a memory usage for each input.</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">):</span>
<span class="o">...</span> <span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span> <span class="o">*</span> <span class="n">i</span><span class="p">,</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">i</span><span class="p">)</span>
<span class="o">...</span> <span class="n">mem_usage</span> <span class="o">=</span> <span class="n">memory_usage</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">pinv</span><span class="p">,</span> <span class="p">(</span><span class="n">A</span><span class="p">,)))</span>
<span class="o">...</span> <span class="nb">print</span> <span class="nb">max</span><span class="p">(</span><span class="n">mem_usage</span><span class="p">)</span>
<span class="mf">29.22</span>
<span class="mf">30.10</span>
<span class="mf">40.66</span>
<span class="mf">53.96</span>
</code></pre></div>
<p>It is now possible to plot these results as a function of the
dimensions.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pylab</span> <span class="k">as</span> <span class="nn">pl</span>
<span class="kn">from</span> <span class="nn">memory_profiler</span> <span class="kn">import</span> <span class="n">memory_usage</span>
<span class="n">dims</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">pinv_mem</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">dims</span><span class="o">.</span><span class="n">size</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i_dim</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dims</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
<span class="n">tmp</span> <span class="o">=</span> <span class="n">memory_usage</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">pinv</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,)),</span> <span class="n">interval</span><span class="o">=</span><span class="mf">.01</span><span class="p">)</span>
<span class="n">pinv_mem</span><span class="p">[</span><span class="n">i_dim</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">tmp</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">dims</span><span class="p">,</span> <span class="n">pinv_mem</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'np.linalg.pinv'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Memory (in MB)'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Dimension of the square matrix'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">'upper left'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s1">'tight'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img alt="Memory plot" src="/blog/static/images/2013/plot_memory_pinv_2.png"></p>Singular Value Decomposition in SciPy2012-12-08T00:00:00+01:002012-12-08T00:00:00+01:00Fabian Pedregosatag:fa.bianp.net,2012-12-08:/blog/2012/singular-value-decomposition-in-scipy/<p>SciPy contains two methods to compute the singular value decomposition (SVD) of a matrix: <code>scipy.linalg.svd</code> and <code>scipy.sparse.linalg.svds</code>. In this post I'll compare both methods for the task of computing the full SVD of a large dense matrix.</p>
<p>The first method, <code>scipy.linalg.svd</code>, is perhaps …</p><p>SciPy contains two methods to compute the singular value decomposition (SVD) of a matrix: <code>scipy.linalg.svd</code> and <code>scipy.sparse.linalg.svds</code>. In this post I'll compare both methods for the task of computing the full SVD of a large dense matrix.</p>
<p>The first method, <code>scipy.linalg.svd</code>, is perhaps the best known and uses the linear algebra library <a href="http://www.netlib.org/lapack/">LAPACK</a> to handle the computations. This implements the Golub-Kahan-Reisch algorithm <sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>, which is accurate and highly efficient with a cost of O(n^3) floating-point operations <sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>.</p>
<p>The second method is <code>scipy.sparse.linalg.svds</code> and despite it's name it works fine also for dense arrays. This implementation is based on the <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> library and consists of an iterative procedure that finds the SVD decomposition by reducing the problem to an eigendecomposition on the array X -> dot(X.T, X). This method is usually very effective when the input matrix X is sparse or only the largest singular values are required. There are other SVD solvers that I did not consider, such as <a href="http://pypi.python.org/pypi/sparsesvd/">sparsesvd</a> or <a href="http://pysparse.sourceforge.net/introduction.html#jdsym">pysparse.jdsym</a>, but my points for the sparse solve probably hold for those packages too since they both implement iterative algorithms based on the same principles.</p>
<p>When the input matrix is dense and all the singular values are required, the first method is usually more efficient. To support this statement I've created a little benchmark: timings for both methods as a function of the size of the matrices. Notice that we are in a case that is clearly favorable to the <code>linalg.svd</code>: after all <code>sparse.linalg.svds</code> was not created with this setting in mind, it was created for sparse matrices or dense matrices with some special structure. We will see however that even in this setting it has interesting advantages.</p>
<p>I'll create random square matrices with different sizes and plot the timings for both methods. For the benchmarks I used SciPy v0.12 linked against Intel Math Kernel Library v11. Both methods are single-threaded (I had to set OMP_NUM_THREADS=1 so that MKL does not try to parallelize the computations). <a href="https://gist.github.com/4250756#file-svd_timing-py">[code]</a></p>
<p><img alt="svd timings" src="http://fa.bianp.net/blog/static/uploads/2012/svd_timing.png"></p>
<p>Lower timings are better, so this gives <code>scipy.linalg.svd</code> as clear winner. However, this is just part of the story. What this graph doesn't show is that this method is winning at the price of allocating a huge amount of memory for temporary computations. If we now plot the memory consumption for both methods under the same settings, the story is completely different. <a href="https://gist.github.com/4250756#file-svd_memory-py">[code]</a></p>
<p><img alt="svd memory" src="http://fa.bianp.net/blog/static/uploads/2012/svd_memory.png"></p>
<p>The memory requirements of <code>scipy.linalg.svd</code> scale with the number of dimensions, while for the sparse version the amount of allocated memory is constant. Notice that we are measuring the amount of total memory used, it is thus natural to see a slight increase in memory consumption since the input matrix is bigger on each iteration.</p>
<p>For example, in my applications, I need to compute the SVD of a matrix for whom the needed workspace does not fit in memory. In cases like this, the sparse algorithm (<code>sparse.linalg.svds</code>) can come in handy: the timing is just a factor worse (but I can easily parallelize jobs) and the memory requirements for this method is peanuts compared to the dense version.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Calculating the singular values and pseudo-inverse of a matrix, <em>Golub, Gene H., Kahan, William</em>, 1965, <a href="http://www.jstor.org/stable/2949777">JSTOR</a> <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>A Survey of Singular Value Decomposition Methods and Performance Comparison of Some Available Serial Codes, <em>Plassman, Gerald E.</em> 2005 <a href="http://research.microsoft.com/en-us/um/people/zhoulin/2005-a%20survey%20of%20singular%20value%20decomposition_plassman.pdf">PDF</a> <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Learning to rank with scikit-learn: the pairwise transform2012-10-23T00:00:00+02:002012-10-23T00:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2012-10-23:/blog/2012/learning-to-rank-with-scikit-learn-the-pairwise-transform/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>This tutorial introduces the concept of pairwise preference used in most <a href="http://en.wikipedia.org/wiki/Learning_to_rank">ranking problems</a>. I'll use scikit-learn and for learning and matplotlib for …</p><script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true
},
TeX: {
equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"]
},
"HTML-CSS": { fonts: ["TeX"] }
});
</script>
<script type="text/javascript" async
src="/node_modules/mathjax2/MathJax.js">
</script>
<p>This tutorial introduces the concept of pairwise preference used in most <a href="http://en.wikipedia.org/wiki/Learning_to_rank">ranking problems</a>. I'll use scikit-learn and for learning and matplotlib for visualization.</p>
<p>In the ranking setting, training data consists of lists of items with some order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item, so that for any two samples <code>a</code> and <code>b</code>, either <code>a < b</code>, <code>b > a</code> or <code>b</code> and <code>a</code> are not comparable.</p>
<p>For example, in the case of a search engine, our dataset consists of results that belong to different queries and we would like to only compare the relevance for results coming from the same query.</p>
<p>This order relation is usually domain-specific. For instance, in information retrieval the set of comparable samples is referred to as a "query id". The goal behind this is to compare only documents that belong to the same query (<a href="http://dx.doi.org/10.1145/775047.775067">Joachims 2002</a>). In medical imaging on the other hand, the order of the labels usually depend on the subject so the comparable samples is given by the different subjects in the study (<a href="http://hal.inria.fr/hal-00717990/en">Pedregosa et al 2012</a>).</p>
<div class="codehilite"><pre><span class="kn">import</span> <span class="nn">itertools</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="kn">import</span> <span class="nn">pylab</span> <span class="kn">as</span> <span class="nn">pl</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">svm</span><span class="p">,</span> <span class="n">linear_model</span><span class="p">,</span> <span class="n">cross_validation</span>
</pre></div>
<p>To start with, we'll create a dataset in which the target values consists of three graded measurements Y = {0, 1, 2} and the input data is a collection of 30 samples, each one with two features.</p>
<p>The set of comparable elements (queries in information retrieval) will consist of two equally sized blocks, $X = X_1 \cup X_2$, where each block is generated using a normal distribution with different mean and covariance. In the pictures, we represent $X_1$ with round markers and $X_2$ with triangular markers.</p>
<div class="codehilite"><pre><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">deg2rad</span><span class="p">(</span><span class="mi">60</span><span class="p">)</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">theta</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">theta</span><span class="p">)])</span>
<span class="n">K</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">K</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">):</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">((</span><span class="n">X</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">w</span><span class="p">))</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">((</span><span class="n">y</span><span class="p">,</span> <span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">K</span><span class="p">))</span>
<span class="c"># slightly displace data corresponding to our second partition</span>
<span class="n">X</span><span class="p">[::</span><span class="mi">2</span><span class="p">]</span> <span class="o">-=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">3</span><span class="p">,</span> <span class="mi">7</span><span class="p">])</span>
<span class="n">blocks</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="mi">2</span><span class="p">))</span>
<span class="c"># split into train and test set</span>
<span class="n">cv</span> <span class="o">=</span> <span class="n">cross_validation</span><span class="o">.</span><span class="n">StratifiedShuffleSplit</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=.</span><span class="mi">5</span><span class="p">)</span>
<span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="nb">iter</span><span class="p">(</span><span class="n">cv</span><span class="p">)</span><span class="o">.</span><span class="n">next</span><span class="p">()</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">b_train</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">train</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="n">train</span><span class="p">],</span> <span class="n">blocks</span><span class="p">[</span><span class="n">train</span><span class="p">]</span>
<span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">b_test</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">test</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="n">test</span><span class="p">],</span> <span class="n">blocks</span><span class="p">[</span><span class="n">test</span><span class="p">]</span>
<span class="c"># plot the result</span>
<span class="n">idx</span> <span class="o">=</span> <span class="p">(</span><span class="n">b_train</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_train</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y_train</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s">'^'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">8</span> <span class="o">*</span> <span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">8</span> <span class="o">*</span> <span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">fc</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span>
<span class="n">head_width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'$w$'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span> <span class="o">*</span> <span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">8</span> <span class="o">*</span> <span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">fc</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span>
<span class="n">head_width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="o">-</span><span class="mf">2.6</span><span class="p">,</span> <span class="o">-</span><span class="mi">7</span><span class="p">,</span> <span class="s">'$w$'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'equal'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p><img alt="" src="http://fa.bianp.net/blog/static/uploads/2012/pairwise_transform_files/pairwise_transform_fig_00.png"></p>
<p>In the plot we clearly see that for both blocks there's a common vector w such that the projection onto w gives a list with the correct ordering.</p>
<p>However, because linear considers that output labels live in a metric space it will consider that all pairs are comparable. Thus if we fit this model to the problem above it will fit both blocks at the same time, yielding a result that is clearly not optimal. In the following plot we estimate $\hat{w}$ using an l2-regularized linear model.</p>
<div class="codehilite"><pre><span class="n">ridge</span> <span class="o">=</span> <span class="n">linear_model</span><span class="o">.</span><span class="n">Ridge</span><span class="p">(</span><span class="mf">1.</span><span class="p">)</span>
<span class="n">ridge</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">coef</span> <span class="o">=</span> <span class="n">ridge</span><span class="o">.</span><span class="n">coef_</span> <span class="o">/</span> <span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">ridge</span><span class="o">.</span><span class="n">coef_</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_train</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y_train</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s">'^'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">7</span> <span class="o">*</span> <span class="n">coef</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">7</span> <span class="o">*</span> <span class="n">coef</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">fc</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span>
<span class="n">head_width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s">'$\hat{w}$'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'equal'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Estimation by Ridge regression'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p><img alt="" src="http://fa.bianp.net/blog/static/uploads/2012/pairwise_transform_files/pairwise_transform_fig_01.png"></p>
<p>To assess the quality of our model we need to define a ranking score. Since we are interesting in a model that <em>orders</em> the data, it is natural to look for a metric that compares the ordering of our model to the given ordering. For this, we use <a href="http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient">Kendall's tau correlation coefficient</a>, which is defined as (P - Q)/(P + Q), being P the number of concordant pairs and Q is the number of discordant pairs. This measure is used extensively in the ranking literature (e.g <a href="http://www.cs.cornell.edu/people/tj/publications/joachims_02c.pdf">Optimizing Search Engines using Clickthrough Data</a>).</p>
<p>We thus evaluate this metric on the test set for each block separately.</p>
<div class="codehilite"><pre><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span>
<span class="n">tau</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">stats</span><span class="o">.</span><span class="n">kendalltau</span><span class="p">(</span>
<span class="n">ridge</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">b_test</span> <span class="o">==</span> <span class="n">i</span><span class="p">]),</span> <span class="n">y_test</span><span class="p">[</span><span class="n">b_test</span> <span class="o">==</span> <span class="n">i</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Kendall correlation coefficient for block </span><span class="si">%s</span><span class="s">: </span><span class="si">%.5f</span><span class="s">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">tau</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre><span></span><code><span class="nv">Kendall</span><span class="w"> </span><span class="nv">correlation</span><span class="w"> </span><span class="nv">coefficient</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nv">block</span><span class="w"> </span><span class="mi">0</span>:<span class="w"> </span><span class="mi">0</span>.<span class="mi">71122</span>
<span class="nv">Kendall</span><span class="w"> </span><span class="nv">correlation</span><span class="w"> </span><span class="nv">coefficient</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nv">block</span><span class="w"> </span><span class="mi">1</span>:<span class="w"> </span><span class="mi">0</span>.<span class="mi">84387</span>
</code></pre></div>
<h2>The pairwise transform</h2>
<p>As proved in (<a href="http://www.mendeley.com/research/support-vector-learning-ordinal-regression/">Herbrich 1999</a>), if we consider linear ranking functions, the ranking problem can be transformed into a two-class classification problem. For this, we form the difference of all comparable elements such that our data is transformed into $(x'_k, y'_k) = (x_i - x_j, sign(y_i - y_j))$ for all comparable pairs.</p>
<p>This way we transformed our ranking problem into a two-class classification problem. The following plot shows this transformed dataset, and color reflects the difference in labels, and our task is to separate positive samples from negative ones. The hyperplane {x^T w = 0} separates these two classes.</p>
<div class="codehilite"><pre><span class="c"># form all pairwise combinations</span>
<span class="n">comb</span> <span class="o">=</span> <span class="n">itertools</span><span class="o">.</span><span class="n">combinations</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">k</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">Xp</span><span class="p">,</span> <span class="n">yp</span><span class="p">,</span> <span class="n">diff</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[]</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="ow">in</span> <span class="n">comb</span><span class="p">:</span>
<span class="k">if</span> <span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">y_train</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> \
<span class="ow">or</span> <span class="n">blocks</span><span class="p">[</span><span class="n">train</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">!=</span> <span class="n">blocks</span><span class="p">[</span><span class="n">train</span><span class="p">][</span><span class="n">j</span><span class="p">]:</span>
<span class="c"># skip if same target or different group</span>
<span class="k">continue</span>
<span class="n">Xp</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">X_train</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="n">diff</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">y_train</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="n">yp</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sign</span><span class="p">(</span><span class="n">diff</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]))</span>
<span class="c"># output balanced classes</span>
<span class="k">if</span> <span class="n">yp</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">**</span> <span class="n">k</span><span class="p">:</span>
<span class="n">yp</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">*=</span> <span class="o">-</span><span class="mi">1</span>
<span class="n">Xp</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">*=</span> <span class="o">-</span><span class="mi">1</span>
<span class="n">diff</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">*=</span> <span class="o">-</span><span class="mi">1</span>
<span class="n">k</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">Xp</span><span class="p">,</span> <span class="n">yp</span><span class="p">,</span> <span class="n">diff</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">asanyarray</span><span class="p">,</span> <span class="p">(</span><span class="n">Xp</span><span class="p">,</span> <span class="n">yp</span><span class="p">,</span> <span class="n">diff</span><span class="p">))</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">Xp</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">Xp</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">diff</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">)</span>
<span class="n">x_space</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_space</span> <span class="o">*</span> <span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="o">-</span> <span class="n">x_space</span> <span class="o">*</span> <span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'gray'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="s">'$\{x^T w = 0\}$'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'equal'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p><img alt="" src="http://fa.bianp.net/blog/static/uploads/2012/pairwise_transform_files/pairwise_transform_fig_02.png"></p>
<p>As we see in the previous plot, this classification is separable. This will not always be the case, however, in our training set there are no order inversions, thus the respective classification problem is separable.</p>
<p>We will now finally train an Support Vector Machine model on the transformed data.
This model is known as RankSVM, although we note that the pairwise transform is more general and can be used together with any linear model. We will then plot the training data together with the estimated coefficient $\hat{w}$ by RankSVM.</p>
<div class="codehilite"><pre><span class="n">clf</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">SVC</span><span class="p">(</span><span class="n">kernel</span><span class="o">=</span><span class="s">'linear'</span><span class="p">,</span> <span class="n">C</span><span class="o">=.</span><span class="mi">1</span><span class="p">)</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">Xp</span><span class="p">,</span> <span class="n">yp</span><span class="p">)</span>
<span class="n">coef</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">coef_</span><span class="o">.</span><span class="n">ravel</span><span class="p">()</span> <span class="o">/</span> <span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">coef_</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_train</span><span class="p">[</span><span class="n">idx</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y_train</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s">'^'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y_train</span><span class="p">[</span><span class="o">~</span><span class="n">idx</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">pl</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">Blues</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">7</span> <span class="o">*</span> <span class="n">coef</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">7</span> <span class="o">*</span> <span class="n">coef</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">fc</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span>
<span class="n">head_width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">arrow</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">8</span><span class="p">,</span> <span class="mi">7</span> <span class="o">*</span> <span class="n">coef</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">7</span> <span class="o">*</span> <span class="n">coef</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">fc</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span> <span class="n">ec</span><span class="o">=</span><span class="s">'gray'</span><span class="p">,</span>
<span class="n">head_width</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">head_length</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">.</span><span class="mi">7</span><span class="p">,</span> <span class="s">'$\hat{w}$'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="o">-</span><span class="mf">2.6</span><span class="p">,</span> <span class="o">-</span><span class="mi">7</span><span class="p">,</span> <span class="s">'$\hat{w}$'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'equal'</span><span class="p">)</span>
<span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p><img alt="" src="http://fa.bianp.net/blog/static/uploads/2012/pairwise_transform_files/pairwise_transform_fig_03.png"></p>
<p>Finally we will check that as expected, the ranking score (Kendall tau) increases with the RankSVM model respect to linear regression.</p>
<div class="codehilite"><pre><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span>
<span class="n">tau</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">stats</span><span class="o">.</span><span class="n">kendalltau</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">b_test</span> <span class="o">==</span> <span class="n">i</span><span class="p">],</span> <span class="n">coef</span><span class="p">),</span> <span class="n">y_test</span><span class="p">[</span><span class="n">b_test</span> <span class="o">==</span> <span class="n">i</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Kendall correlation coefficient for block </span><span class="si">%s</span><span class="s">: </span><span class="si">%.5f</span><span class="s">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">tau</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre><span></span><code><span class="nv">Kendall</span><span class="w"> </span><span class="nv">correlation</span><span class="w"> </span><span class="nv">coefficient</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nv">block</span><span class="w"> </span><span class="mi">0</span>:<span class="w"> </span><span class="mi">0</span>.<span class="mi">83627</span>
<span class="nv">Kendall</span><span class="w"> </span><span class="nv">correlation</span><span class="w"> </span><span class="nv">coefficient</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nv">block</span><span class="w"> </span><span class="mi">1</span>:<span class="w"> </span><span class="mi">0</span>.<span class="mi">84387</span>
</code></pre></div>
<p>This is indeed higher than the values (0.71122, 0.84387) obtained in the case of linear regression.</p>
<p><strong>Original ipython notebook for this blog post can be found <a href="https://github.com/fabianp/minirank/blob/master/notebooks/pairwise_transform.ipynb">here</a></strong></p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>"Large Margin Rank Boundaries for Ordinal Regression", R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers, 115-132, Liu Press, 2000 <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>"Optimizing Search Engines Using Clickthrough Data", T. Joachims. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:3">
<p>"Learning to rank from medical imaging data", Pedregosa et al. [<a href="http://arxiv.org/abs/1207.3598">arXiv</a>] <a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:4">
<p>"Efficient algorithms for ranking with SVMs", O. Chapelle and S. S. Keerthi, Information Retrieval Journal, Special Issue on Learning to Rank, 2009 <a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:5">
<p><a href="http://opensourceconnections.com/blog/2017/08/03/search-as-machine-learning-prob/">Doug Turnbull's blog post on learning to rank</a> <a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
</ol>
</div>line-by-line memory usage of a Python program2012-04-24T07:04:00+02:002012-04-24T07:04:00+02:00Fabian Pedregosatag:fa.bianp.net,2012-04-24:/blog/2012/line-by-line-report-of-memory-usage/<p>My newest project is a Python library for monitoring memory consumption
of arbitrary process, and one of its most useful features is the
line-by-line analysis of memory usage for Python code. I wrote a basic
prototype six months ago after being surprised by the lack of related
tools. I wanted …</p><p>My newest project is a Python library for monitoring memory consumption
of arbitrary process, and one of its most useful features is the
line-by-line analysis of memory usage for Python code. I wrote a basic
prototype six months ago after being surprised by the lack of related
tools. I wanted to <a class="reference external" href="http://fa.bianp.net/blog/2011/qr_multiply-function-in-scipy-linalg/">plot memory consumption</a> of a couple of Python
functions but did not find a python module to do the job. I came to the
conclusion that there is no standard way to get the memory usage of the
Python interpreter from within Python, so I resorted to reading for from
<tt class="docutils literal"><span class="pre">/proc/$PID/statm</span></tt>. From there on I realized that one the fetching of
memory is done, making a line-by-line report wouldn't be hard. Back to
today. I've been using the line-by-line memory monitoring to diagnose
poor memory management (hidden temporaries, unused allocation, etc.) for
some time. It seems to work on two different computers, so full of
confidence as I am, I'll write a blog post about it ...</p>
<div class="section" id="how-to-use-it">
<h2>How to use it?</h2>
<p>The easiest way to get it is to install from the Python Package Index:</p>
<div class="highlight"><pre><span></span>$<span class="w"> </span>easy_install<span class="w"> </span>-U<span class="w"> </span>memory_profiler<span class="w"> </span><span class="c1"># pip install -U memory_profiler</span>
</pre></div>
<p>but other options include fetching the latests
from <a class="reference external" href="https://github.com/fabianp/memory_profiler">github</a> or dropping it on your current working directory or
somewhere else on your PYTHONPATH since it consist of a single file.
Then next step is to write some python code to profile. It can be just
about any function, but for the purpose of this blog post I'll create a
function <tt class="docutils literal"><span class="pre">`my_func()`</span></tt> with mostly memory allocations and save it to a file
named example.py:</p>
<div class="highlight"><pre><span></span><span class="nd">@profile</span>
<span class="k">def</span> <span class="nf">my_func</span><span class="p">():</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="mi">10</span> <span class="o">**</span> <span class="mi">6</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="mi">10</span> <span class="o">**</span> <span class="mi">7</span><span class="p">)</span>
<span class="k">del</span> <span class="n">b</span>
<span class="k">return</span> <span class="n">a</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">my_func</span><span class="p">()</span>
</pre></div>
<p>Note that I've decorated the function
with @profile. This tells the profiler to look into function my_func
and gather the memory consumption for each line.</p>
</div>
<div class="section" id="wake-up-the-cookie-monster">
<h2>Wake up the cookie monster</h2>
<p>To start profiling and output the result to stdout, run the script as
usual and append the options <cite>-m memory_profiler -l -v</cite> to the python
interpreter:</p>
<div class="highlight"><pre><span></span>$<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>memory_profiler<span class="w"> </span>example.py
Filename:<span class="w"> </span>example.py
Line<span class="w"> </span><span class="c1"># Mem usage Increment Line Contents</span>
<span class="o">================================================</span>
<span class="w"> </span><span class="m">2</span><span class="w"> </span>@profile
<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">8</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="m">0</span>.00<span class="w"> </span>MB<span class="w"> </span>def<span class="w"> </span>my_func<span class="o">()</span>:
<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="m">15</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="m">7</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="nv">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">[</span><span class="m">1</span><span class="o">]</span><span class="w"> </span>*<span class="w"> </span><span class="o">(</span><span class="m">10</span><span class="w"> </span>**<span class="w"> </span><span class="m">6</span><span class="o">)</span>
<span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="m">168</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="m">153</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="nv">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">[</span><span class="m">2</span><span class="o">]</span><span class="w"> </span>*<span class="w"> </span><span class="o">(</span><span class="m">2</span><span class="w"> </span>*<span class="w"> </span><span class="m">10</span><span class="w"> </span>**<span class="w"> </span><span class="m">7</span><span class="o">)</span>
<span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="m">15</span>.00<span class="w"> </span>MB<span class="w"> </span>-153.00<span class="w"> </span>MB<span class="w"> </span>del<span class="w"> </span>b
<span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="m">15</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="m">0</span>.00<span class="w"> </span>MB<span class="w"> </span><span class="k">return</span><span class="w"> </span>a
</pre></div>
<p>voilá! Each line is prefixed by the memory usage in
MB of the Python interpreter after that line has been executed.</p>
</p></div>
Low rank approximation2011-11-06T12:05:00+01:002011-11-06T12:05:00+01:00Fabian Pedregosatag:fa.bianp.net,2011-11-06:/blog/2011/low-rank-approximation/<p>A little experiment to see what low rank approximation looks like. These
are the best rank-k approximations (in the Frobenius norm) to the a
natural image for increasing values of k and an original image of rank
512.</p>
<img alt="" src="/blog/static/uploads/2011/11/animation1.gif" />
<p>Python code can be found <a class="reference external" href="https://gist.github.com/1342033">here</a>. GIF animation made
using ImageMagic's convert …</p><p>A little experiment to see what low rank approximation looks like. These
are the best rank-k approximations (in the Frobenius norm) to the a
natural image for increasing values of k and an original image of rank
512.</p>
<img alt="" src="/blog/static/uploads/2011/11/animation1.gif" />
<p>Python code can be found <a class="reference external" href="https://gist.github.com/1342033">here</a>. GIF animation made
using ImageMagic's convert script.</p>
qr_multiply function in scipy.linalg2011-10-14T16:44:00+02:002011-10-14T16:44:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-10-14:/blog/2011/qr_multiply-function-in-scipylinalg/<p>In scipy's development version there's a new function closely related to
the <a class="reference external" href="http://en.wikipedia.org/wiki/QR_decomposition">QR-decomposition</a> of a matrix and to the least-squares solution of
a linear system. What this function does is to compute the
QR-decomposition of a matrix and then multiply the resulting orthogonal
factor by another arbitrary matrix. In pseudocode …</p><p>In scipy's development version there's a new function closely related to
the <a class="reference external" href="http://en.wikipedia.org/wiki/QR_decomposition">QR-decomposition</a> of a matrix and to the least-squares solution of
a linear system. What this function does is to compute the
QR-decomposition of a matrix and then multiply the resulting orthogonal
factor by another arbitrary matrix. In pseudocode:</p>
<pre class="literal-block">
def qr_multiply(X, Y):
Q, R = qr(X)
return dot(Q.T, Y)
</pre>
<p>but unlike this naive implementation, <tt class="docutils literal">qr_multiply</tt> is able to do
all this <strong>without</strong> explicitly computing the orthogonal Q matrix,
resulting both in memory and time saving. In the following picture I
measured the memory consumption as a function of time of running this
computation on a 1.000 x 1.000 matrix X and a vector Y (full code can
be found <a class="reference external" href="https://gist.github.com/1287168">here</a>):</p>
<img alt="" src="uploads/2011/10/qr_multiply1-300x225.png" />
<p>It can be seen that not only <tt class="docutils literal">qr_multiply</tt> is
almost twice as fast as the naive approach, but also that the memory
consumption is significantly reduced, since the orthogonal factor is
never explicitly computed. Credit for implementing the qr_multiply
function goes to <a class="reference external" href="https://github.com/tecki">Martin Teichmann</a>.</p>
scikit-learn 0.92011-10-02T11:19:00+02:002011-10-02T11:19:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-10-02:/blog/2011/scikit-learn-09/<p>Last week we released a new version of scikit-learn. The <a class="reference external" href="http://scikit-learn.sourceforge.net/stable/whats_new.html">Changelog is
particularly impressive</a>, yet personally this release is important for
other reasons. This will probably be my last release as a paid engineer.
I'm starting a PhD next month, and although I plan to continue
contributing to the project …</p><p>Last week we released a new version of scikit-learn. The <a class="reference external" href="http://scikit-learn.sourceforge.net/stable/whats_new.html">Changelog is
particularly impressive</a>, yet personally this release is important for
other reasons. This will probably be my last release as a paid engineer.
I'm starting a PhD next month, and although I plan to continue
contributing to the project and make a few more releases, I will
certainly have less time to devote to it. Luckily, I received a lot of
help from the community while preparing the release, from Changelog
writing to build of Windows binaries, thus I expect the transition to go
smoothly. Almost two years have elapsed since the first 0.1 release.
During this time, we did a lot of refactoring and broke the API several
times. However, I've seen some concerns about API stability both at the
EuroScipy conference and in the mailing list where I’ve realized we need
to provide an API that does not break in every release, and do this in a
way that the project remains fun for developers. That's why I'm
extremely glad to see that although this release is big in changes,
these have been made in a more organized manner. Yes, we've broken the
API once again, but now there's a compatibility layer that ensures that
code written for 0.8 will continue working with the new release.</p>
Reworked example gallery for scikit-learn2011-09-04T20:09:00+02:002011-09-04T20:09:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-09-04:/blog/2011/reworked-example-gallery-for-scikit-learn/<p>I've been working lately in improving the scikit-learn example gallery
to show also a small thumbnail of the plotted result. Here is what the
gallery looks like now:</p>
<img alt="" src="http://fa.bianp.net/blog/static/uploads/2011/09/screenshot.png" />
<p>And the real thing should be already displayed in the <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/auto_examples/index.html">development-documentation</a>. The next thing is to add a static image to those …</p><p>I've been working lately in improving the scikit-learn example gallery
to show also a small thumbnail of the plotted result. Here is what the
gallery looks like now:</p>
<img alt="" src="http://fa.bianp.net/blog/static/uploads/2011/09/screenshot.png" />
<p>And the real thing should be already displayed in the <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/auto_examples/index.html">development-documentation</a>. The next thing is to add a static image to those that
don't generate any result, examples such as the <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/auto_examples/applications/svm_gui.html">SVM-GUI</a> should have
an image to display.</p>
scikit-learn’s EuroScipy 2011 coding sprint -- day two2011-08-25T00:33:00+02:002011-08-25T00:33:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-08-25:/blog/2011/scikit-learns-euroscipy-2011-coding-sprint-day-two/<p><img alt="image0" src="http://fseoane.net/blog/static/uploads/2011/08/all-300x225.jpg" /></p>
<p>Today's coding sprint was a bit more crowded, with some
notable scipy hackers such as Ralph Gommers, <a class="reference external" href="http://mentat.za.net/">Stefan van der Walt</a>,
<a class="reference external" href="http://cournape.wordpress.com/">David Cournapeau</a> or <a class="reference external" href="http://blog.fperez.org/">Fernando Perez</a> from Ipython joining in. On
what got done: - We merged <a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/">Jake</a>'s new BallTree code. This is a pure
Cython implementation of a nearest-neighbor …</p><p><img alt="image0" src="http://fseoane.net/blog/static/uploads/2011/08/all-300x225.jpg" /></p>
<p>Today's coding sprint was a bit more crowded, with some
notable scipy hackers such as Ralph Gommers, <a class="reference external" href="http://mentat.za.net/">Stefan van der Walt</a>,
<a class="reference external" href="http://cournape.wordpress.com/">David Cournapeau</a> or <a class="reference external" href="http://blog.fperez.org/">Fernando Perez</a> from Ipython joining in. On
what got done: - We merged <a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/">Jake</a>'s new BallTree code. This is a pure
Cython implementation of a nearest-neighbor search similar to the KDTree
class in scipy.spatial, but much faster. The code looks awesome and it's
a big speedup compared to the older code. - Vlad is ready to merge
his<a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/221">dictionary learning code</a>, something that should happen in the
upcoming days. - Initial support for Python 3. scikit-learn should now
at least build and import cleanly under Python 3. - some bugfixes in the
Pipeline object and in docstrings. So this was the end of the
scikit-learn sprint, but EuroScipy has just begun. See you tomorrow at
the conference (follow the signs)!</p>
<p><img alt="image1" src="http://fseoane.net/blog/static/uploads/2011/08/IMG_0093-202x300.jpg" /> <img alt="image2" src="http://fseoane.net/blog/static/uploads/2011/08/IMG_0092-189x300.jpg" /></p>
scikit-learn EuroScipy 2011 coding sprint -- day one2011-08-23T21:38:00+02:002011-08-23T21:38:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-08-23:/blog/2011/scikit-learn-euroscipy-2011-coding-sprint-day-one/<p>As a warm-up for the upcoming <a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2011">EuroScipy-conference</a>, some of the
<a class="reference external" href="http://scikit-learn.sf.net">scikit-learn</a> developers decided to gather and work together for a
couple of days. Today was the first day and there was only a handfull of
us, as the real kickoff is expected tomorrow. Some interesting coding
happened, although most of …</p><p>As a warm-up for the upcoming <a class="reference external" href="http://www.euroscipy.org/conference/euroscipy2011">EuroScipy-conference</a>, some of the
<a class="reference external" href="http://scikit-learn.sf.net">scikit-learn</a> developers decided to gather and work together for a
couple of days. Today was the first day and there was only a handfull of
us, as the real kickoff is expected tomorrow. Some interesting coding
happened, although most of us where still preparing material for the
EuroScipy tutorials ... - API changes: remove of keyword parameters to
<em>fit</em> method, added method <em>set_params</em> (<a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/306">pull request 1</a>). - Some
bugfixing in NuSVR (<a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/315">pull request 2</a>) - Review of <a class="reference external" href="http://vene.ro">Vlad</a>'s code,
developed during his Summer of Code program. - A lot of discussion about
algorithm, code, APIs and buildbot dance !</p>
<img alt="" src="http://fa.bianp.net/blog/uploads/2011/08/IMG_0076-150x150.jpg" />
<img alt="" src="uploads/2011/08/Picture-3-150x150.png" />
<img alt="" src="uploads/2011/08/IMG_0074-150x150.jpg" />
<img alt="" src="uploads/2011/08/Picture-5-150x150.png" />
<img alt="" src="uploads/2011/08/emanuelle-150x150.jpg" />
Ridge regression path2011-07-12T09:21:00+02:002011-07-12T09:21:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-07-12:/blog/2011/ridge-regression-path/<p>Ridge coefficients for multiple values of the regularization parameter
can be elegantly computed by updating the <em>thin</em> SVD decomposition of
the design matrix:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">linalg</span>
<span class="k">def</span> <span class="nf">ridge</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">alphas</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Return coefficients for regularized least squares</span>
<span class="sd"> min ||A x - b||^2 + alpha ||x||^2 …</span></pre></div><p>Ridge coefficients for multiple values of the regularization parameter
can be elegantly computed by updating the <em>thin</em> SVD decomposition of
the design matrix:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">linalg</span>
<span class="k">def</span> <span class="nf">ridge</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">alphas</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Return coefficients for regularized least squares</span>
<span class="sd"> min ||A x - b||^2 + alpha ||x||^2</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> A : array, shape (n, p)</span>
<span class="sd"> b : array, shape (n,)</span>
<span class="sd"> alphas : array, shape (k,)</span>
<span class="sd"> Returns</span>
<span class="sd"> -------</span>
<span class="sd"> coef: array, shape (p, k)</span>
<span class="sd"> """</span>
<span class="n">U</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">Vt</span> <span class="o">=</span> <span class="n">linalg</span><span class="o">.</span><span class="n">svd</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">full_matrices</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">s</span> <span class="o">/</span> <span class="p">(</span><span class="n">s</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">]</span><span class="o">.</span><span class="n">T</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">alphas</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">])</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">d</span> <span class="o">*</span> <span class="n">U</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">Vt</span><span class="p">)</span><span class="o">.</span><span class="n">T</span>
</pre></div>
<p>This can be used to efficiently compute what it's <em>regularization
path</em>, that is, to plot the coefficients as a function of the
regularization parameter. Since the bottleneck of the algorithm is the
singular value decomposition, computing the coefficients for other
values of the regularization parameter basically comes for free.</p>
<img alt="" src="http://fa.bianp.net/blog/static/uploads/2011/07/ridge_nocv.png" />
<p>A variant of this algorithm can then be used to compute the
optimal regularization parameter in the sense of leave-one-out
cross-validation and is implemented in scikit-learn's <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/linear_model.html#generalized-cross-validation">RidgeCV</a> (for
which Mathieu Blondel has an <a class="reference external" href="http://www.mblondel.org/journal/2011/02/09/regularized-least-squares/">excelent post</a> by ). This optimal
parameter is denoted with a vertical dotted line in the following
picture, full code can be found <a class="reference external" href="https://gist.github.com/1076844">here</a>.</p>
<img alt="" src="http://fa.bianp.net/blog/static/uploads/2011/07/ridge.png" />
LLE comes in different flavours2011-06-30T16:22:00+02:002011-06-30T16:22:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-06-30:/blog/2011/lle-comes-in-different-flavours/<p>I haven't worked in the manifold module since <a class="reference external" href="http://fa.bianp.net/blog/2011/manifold-learning-in-scikit-learn/">last time</a>, yet thanks
to <a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/">Jake VanderPlas</a> there are some cool features I can talk about.
First of, the ARPACK backend is finally working and gives factor one
speedup over the <a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">lobcpg + PyAMG approach</a>. The key is to use ARPACK's
shift-invert mode …</p><p>I haven't worked in the manifold module since <a class="reference external" href="http://fa.bianp.net/blog/2011/manifold-learning-in-scikit-learn/">last time</a>, yet thanks
to <a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/">Jake VanderPlas</a> there are some cool features I can talk about.
First of, the ARPACK backend is finally working and gives factor one
speedup over the <a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">lobcpg + PyAMG approach</a>. The key is to use ARPACK's
shift-invert mode instead of the regular mode, a subtle change that
drove me crazy for weeks and that Jake spotted by comparing it to his
<a class="reference external" href="https://github.com/jakevdp/pyLLE">C++ LLE implementation</a>. More importantly, some variants of Locally
Linear Embedding (LLE) have been added to the module: <a class="reference external" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.382">Modified LLE</a>,
<a class="reference external" href="http://www-stat.stanford.edu/~donoho/Reports/2003/HessianEigenmaps.pdf">Hessian LLE</a> and <a class="reference external" href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.3693">LTSA</a>. These seem to generate better solutions than
the classical LLE with timings that are not far apart. All the LLE
variants currently implemented can be seen in <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/auto_examples/manifold/plot_compare_methods.html">this example</a>, where
they are applied to an S-shaped dataset.</p>
<img alt="" src="http://fa.bianp.net/blog/wp-content/uploads/2011/06/manifold_methods.png" />
<img alt="" src="http://fa.bianp.net/blog/wp-content/uploads/2011/06/manifold_methods.png" />
Manifold learning in scikit-learn2011-06-07T09:19:00+02:002011-06-07T09:19:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-06-07:/blog/2011/manifold-learning-in-scikit-learn/<p>The manifold module in <a class="reference external" href="http://scikit-learn.sf.net">scikit-learn</a> is slowly progressing: the
<a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">locally linear embedding</a> implementation was finally merged along with
<a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/manifold.html">some documentation</a>. At about the same time but in a different
timezone, <a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/">Jake VanderPlas</a> began coding <a class="reference external" href="https://github.com/jakevdp/scikit-learn/compare/master...manifold">other manifold learning
methods</a> and back in Paris <a class="reference external" href="http://twitter.com/ogrisel">Olivier Grisel</a> made <a class="reference external" href="http://fa.bianp.net/blog/2011/handwritten-digits-and-locally-linear-embedding/">my digits example</a>
a <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/auto_examples/manifold/plot_lle_digits.html">lot …</a></p><p>The manifold module in <a class="reference external" href="http://scikit-learn.sf.net">scikit-learn</a> is slowly progressing: the
<a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">locally linear embedding</a> implementation was finally merged along with
<a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/manifold.html">some documentation</a>. At about the same time but in a different
timezone, <a class="reference external" href="http://www.astro.washington.edu/users/vanderplas/">Jake VanderPlas</a> began coding <a class="reference external" href="https://github.com/jakevdp/scikit-learn/compare/master...manifold">other manifold learning
methods</a> and back in Paris <a class="reference external" href="http://twitter.com/ogrisel">Olivier Grisel</a> made <a class="reference external" href="http://fa.bianp.net/blog/2011/handwritten-digits-and-locally-linear-embedding/">my digits example</a>
a <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/auto_examples/manifold/plot_lle_digits.html">lot nicer</a> by adding the embedding of different dimensionality
reduction techniques from scikit-learn:</p>
<p><img alt="image0" src="http://fa.bianp.net/blog/wp-content/uploads/2011/06/plot_lle_digits_4-300x225.png" /> <img alt="image1" src="http://fa.bianp.net/blog/wp-content/uploads/2011/06/plot_lle_digits_3-300x225.png" />
<img alt="image2" src="http://fa.bianp.net/blog/wp-content/uploads/2011/06/plot_lle_digits_2-300x225.png" /> <img alt="image3" src="http://fa.bianp.net/blog/wp-content/uploads/2011/06/plot_lle_digits_1-300x225.png" /></p>
Handwritten digits and Locally Linear Embedding2011-05-04T10:46:00+02:002011-05-04T10:46:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-05-04:/blog/2011/handwritten-digits-and-locally-linear-embedding/<p>I decided to test my <a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">new Locally Linear Embedding (LLE)</a>
implementation against a real dataset. At first I didn't think this
would turn out very well, since LLE seems to be somewhat fragile,
yielding largely different results for small differences in parameters
such as number of neighbors or tolerance, but …</p><p>I decided to test my <a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">new Locally Linear Embedding (LLE)</a>
implementation against a real dataset. At first I didn't think this
would turn out very well, since LLE seems to be somewhat fragile,
yielding largely different results for small differences in parameters
such as number of neighbors or tolerance, but as it turns out, results
are not bad at all. The idea is to take a handwritten digit, stored as a
8x8 pixel image and flatten it into a an array of 8x8 = 64
floating-point values.</p>
<p><img alt="image0" src="http://fa.bianp.net/blog/static/uploads/2011/05/digits_transformation1.png" /></p>
<p>Then each handwritten digit can be
seen as a point in a 64-dimensional space. Of course, visualizing in
64-dimensional spaces is not easy, and that's where <a class="reference external" href="http://fa.bianp.net/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/">Locally Linear
Embedding</a> comes handy. We'll use this method to reduce the dimension
from 64 to 2 with the hope of preserving most of the underlying manifold
structure. The following is a plot of the handwritten digits {0, 1, 2,
3, 4} after performing locally linear embedding. As you can see, some
groups are nicely clustered, notably the 0 is isolated while other like
{4, 5} are closer, precisely those that are more similar.</p>
<p><img alt="image1" src="http://fa.bianp.net/blog/static/uploads/2011/05/Picture-1.png" /></p>
<p>Source code for this example <a class="reference external" href="https://gist.github.com/954815">can be found here</a> but relies on my
manifold branch of scikit-learn.</p>
Low-level routines for Support Vector Machines2011-04-27T15:27:00+02:002011-04-27T15:27:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-27:/blog/2011/low-level-routines-for-support-vector-machines/<p>I've been working lately in improving the low-level API of the libsvm
bindings in scikit-learn. The goal is to provide an API that encourages
an efficient use of these libraries for expert users. These are methods
that have lower overhead than the <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/svm.html">object-oriented interface</a> as they
are closer to the …</p><p>I've been working lately in improving the low-level API of the libsvm
bindings in scikit-learn. The goal is to provide an API that encourages
an efficient use of these libraries for expert users. These are methods
that have lower overhead than the <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/svm.html">object-oriented interface</a> as they
are closer to the C implementation, but do not have an interface as
polished. Here, all parameters are expected to be of the correct type,
and submitting one of the wrong type will make the function exit
immediately with a ValueError. For instance, input data is expected to
be of type float64, even for class labels! Another peculiarity of these
methods is that they only take and return numpy arrays. No custom
objects, all method take and return arrays. That looks something like:
[cc lang="python"] import numpy as np from scikits.learn import svm,
datasets iris = datasets.load_iris() iris.target =
iris.target.astype(np.float64) learned_params =
svm.libsvm.fit(iris.data, iris.target) pred =
svm.libsvm.predict(iris.data, *learned_params) [/cc] Here, I used the
fact that the parameters returned by <a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/generated/scikits.learn.svm.libsvm.fit.html">libsvm.fit</a> can just passed to
<a class="reference external" href="http://scikit-learn.sourceforge.net/dev/modules/generated/scikits.learn.svm.libsvm.predict.html">libsvm.predict</a>. However, any other given parameters should be
manually passed to both method.</p>
new get_blas_funcs in scipy.linalg2011-04-23T18:24:00+02:002011-04-23T18:24:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-23:/blog/2011/new-get_blas_funcs-in-scipylinalg/<p>Today got merged some changes I made to function
scipy.linalg.get_blas_funcs(). The main enhacement is that
get_blas_funcs() now also accepts a single string as input parameter
and a dtype, so that fetching the BLAS function for a specific type
becomes more natural. For example, fetching the gemm routine for …</p><p>Today got merged some changes I made to function
scipy.linalg.get_blas_funcs(). The main enhacement is that
get_blas_funcs() now also accepts a single string as input parameter
and a dtype, so that fetching the BLAS function for a specific type
becomes more natural. For example, fetching the gemm routine for a
single-precision complex number now looks like this: [cc lang="python"]
gemm = scipy.linalg.get_blas_funcs('gemm', dtype=np.complex64) [/cc]
compared to the clumsy old syntax: [cc lang="python"] X = np.empty(0,
dtype=np.complex64) gemm, = scipy.linalg.get_blas_funcs(('gemm',),
(X,)) [/cc]</p>
Locally linear embedding and sparse eigensolvers2011-04-21T14:28:00+02:002011-04-21T14:28:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-21:/blog/2011/locally-linear-embedding-and-sparse-eigensolvers/<p>I've been working for some time on implementing a <a class="reference external" href="http://www.cs.nyu.edu/~roweis/lle/algorithm.html">locally linear
embedding</a> algorithm for the upcoming manifold module in scikit-learn.
While several implementations of this algorithm exist in Python, as far
as I know none of them is able to use a sparse eigensolver in the last
step of the …</p><p>I've been working for some time on implementing a <a class="reference external" href="http://www.cs.nyu.edu/~roweis/lle/algorithm.html">locally linear
embedding</a> algorithm for the upcoming manifold module in scikit-learn.
While several implementations of this algorithm exist in Python, as far
as I know none of them is able to use a sparse eigensolver in the last
step of the algorithm, falling back to dense routines causing a huge
overhead in this step. To overcome this, my first implementation used
<tt class="docutils literal">scipy.sparse.linalg.eigsh</tt>, which is a sparse eigensolver shipped by
scipy and based on ARPACK. However, this approach converged extremely
slowly, with timings that exceeded largely those of dense solvers.
Recently I found a way that seems to work reasonably well, with timings
that win by a factor of 5 on the swiss roll existing routines. This code
is able to solve the problem making use of a preconditioner computed by
<a class="reference external" href="http://code.google.com/p/pyamg/">PyAMG</a>.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy.sparse</span> <span class="kn">import</span> <span class="n">linalg</span><span class="p">,</span> <span class="n">eye</span>
<span class="kn">from</span> <span class="nn">pyamg</span> <span class="kn">import</span> <span class="n">smoothed_aggregation_solver</span>
<span class="kn">from</span> <span class="nn">scikits.learn</span> <span class="kn">import</span> <span class="n">neighbors</span>
<span class="k">def</span> <span class="nf">locally_linear_embedding</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="p">,</span> <span class="n">out_dim</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">1e-6</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">200</span><span class="p">):</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">neighbors</span><span class="o">.</span><span class="n">kneighbors_graph</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="n">n_neighbors</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s1">'barycenter'</span><span class="p">)</span> <span class="c1"># M = (I-W)' (I-W)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">eye</span><span class="p">(</span><span class="o">*</span><span class="n">W</span><span class="o">.</span><span class="n">shape</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="n">W</span><span class="o">.</span><span class="n">format</span><span class="p">)</span> <span class="o">-</span> <span class="n">W</span>
<span class="n">A</span> <span class="o">=</span> <span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">T</span><span class="p">)</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="o">.</span><span class="n">tocsr</span><span class="p">()</span> <span class="c1"># initial approximation to the eigenvectors X = np.random.rand(W.shape[0], out\_dim)</span>
<span class="n">ml</span> <span class="o">=</span> <span class="n">smoothed</span>\<span class="n">_aggregation</span>\<span class="n">_solver</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">symmetry</span><span class="o">=</span><span class="s1">'symmetric'</span><span class="p">)</span>
<span class="n">prec</span> <span class="o">=</span> <span class="n">ml</span><span class="o">.</span><span class="n">aspreconditioner</span><span class="p">()</span> <span class="c1"># compute eigenvalues and eigenvectors with LOBPCG</span>
<span class="n">eigen</span>\<span class="n">_values</span><span class="p">,</span> <span class="n">eigen</span>\<span class="n">_vectors</span> <span class="o">=</span> <span class="n">linalg</span><span class="o">.</span><span class="n">lobpcg</span><span class="p">(</span> <span class="n">A</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">M</span><span class="o">=</span><span class="n">prec</span><span class="p">,</span> <span class="n">largest</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="n">tol</span><span class="p">,</span> <span class="n">maxiter</span><span class="o">=</span><span class="nb">max</span>\<span class="n">_iter</span><span class="p">)</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">eigen</span>\<span class="n">_values</span><span class="p">)</span>
<span class="k">return</span> <span class="n">eigen</span>\<span class="n">_vectors</span><span class="p">[:,</span> <span class="n">index</span><span class="p">],</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">eigen</span>\<span class="n">_values</span><span class="p">)</span> <span class="p">[</span><span class="o">/</span><span class="n">cc</span><span class="p">]</span>
</pre></div>
<p>Full code for this algorithm applied to the
swiss roll can be found here <a class="reference external" href="https://gist.github.com/934363">here</a>, and I hope it will soon be part of
<a class="reference external" href="http://scikit-learn.sourceforge.net/">scikit-learn</a>.</p>
<p><img alt="image0" src="http://fa.bianp.net/blog/static/uploads/2011/04/lle1-690x1024.png" /></p>
scikits.learn is now part of pythonxy2011-04-20T13:48:00+02:002011-04-20T13:48:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-20:/blog/2011/scikitslearn-is-now-part-of-pythonxy/<p>The guys behind <a class="reference external" href="http://www.pythonxy.com/">pythonxy</a> have been kind enough to add the latest
scikit-learn as an <a class="reference external" href="http://code.google.com/p/pythonxy/wiki/AdditionalPlugins">additional plugin</a> for their distribution. Having
scikit-learn being in both <a class="reference external" href="http://www.pythonxy.com/">pythonxy</a> and <a class="reference external" href="http://www.enthought.com/products/epd.php">EPD</a> will hopefully make it
easier to use for Windows users. <img alt="pythonxy-logo" src="http://fa.bianp.net/blog/static/uploads/2011/04/pythonxy-logo.png" /> For now I will continue
to make windows precompiled binaries, but pythonxy …</p><p>The guys behind <a class="reference external" href="http://www.pythonxy.com/">pythonxy</a> have been kind enough to add the latest
scikit-learn as an <a class="reference external" href="http://code.google.com/p/pythonxy/wiki/AdditionalPlugins">additional plugin</a> for their distribution. Having
scikit-learn being in both <a class="reference external" href="http://www.pythonxy.com/">pythonxy</a> and <a class="reference external" href="http://www.enthought.com/products/epd.php">EPD</a> will hopefully make it
easier to use for Windows users. <img alt="pythonxy-logo" src="http://fa.bianp.net/blog/static/uploads/2011/04/pythonxy-logo.png" /> For now I will continue
to make windows precompiled binaries, but pythonxy users finally have a
package that is guaranteed to work with their installation.</p>
Least squares with equality constrain2011-04-14T10:02:00+02:002011-04-14T10:02:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-14:/blog/2011/least-squares-with-equality-constrain/<p>The following algorithm computes the Least squares solution || Ax -
b|| subject to the equality constrain Bx = d. It's a classic algorithm
that can be implemented only using a QR decomposition and a least
squares solver. This implementation uses numpy and scipy. It makes use
of the new linalg.solve_triangular function …</p><p>The following algorithm computes the Least squares solution || Ax -
b|| subject to the equality constrain Bx = d. It's a classic algorithm
that can be implemented only using a QR decomposition and a least
squares solver. This implementation uses numpy and scipy. It makes use
of the new linalg.solve_triangular function in scipy 0.9, although
degrades to linalg.solve on older versions.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">lse</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">cond</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Equality-contrained least squares.The following algorithm minimizes</span>
<span class="sd"> ||Ax - b|| subject to the constrain Bx = d.</span>
<span class="sd"> Parameters</span>
<span class="sd"> ----------</span>
<span class="sd"> A : array-like, shape=[m, n]</span>
<span class="sd"> b : array-like, shape=[m]</span>
<span class="sd"> B : array-like, shape=[p, n]</span>
<span class="sd"> d : array-like, shape=[p]</span>
<span class="sd"> cond : float, optional Cutoff for 'small' singular</span>
<span class="sd"> values; used to determine effective rank of A. Singular values smaller</span>
<span class="sd"> than \`\`rcond \* largest\_singular\_value\`\` are considered zero.</span>
<span class="sd"> Reference</span>
<span class="sd"> ---------</span>
<span class="sd"> Matrix Computations, Golub & van Loan, algorithm 12.1.2</span>
<span class="sd"> Examples</span>
<span class="sd"> --------</span>
<span class="sd"> >>> A, b = [[0, 2, 3], [1, 3, 4.5]], [1, 1]</span>
<span class="sd"> >>> B, d = [[1, 1, 0]], [1]</span>
<span class="sd"> >>> lse(A, b, B, d) array([-0.5 , 1.5 , -0.66666667])</span>
<span class="sd"> """</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">linalg</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">linalg</span><span class="p">,</span> <span class="s1">'solve_triangular'</span><span class="p">):</span> <span class="c1"># compatibility for old scipy</span>
<span class="k">def</span> <span class="nf">solve_triangular</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="k">return</span> <span class="n">linalg</span><span class="o">.</span><span class="n">solve</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">solve_triangular</span> <span class="o">=</span> <span class="n">linalg</span><span class="o">.</span><span class="n">solve_triangular</span>
<span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">d</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">asanyarray</span><span class="p">,</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">d</span><span class="p">))</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">B</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">Q</span><span class="p">,</span> <span class="n">R</span> <span class="o">=</span> <span class="n">linalg</span><span class="o">.</span><span class="n">qr</span><span class="p">(</span><span class="n">B</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">solve</span>\<span class="n">_triangular</span><span class="p">(</span><span class="n">R</span><span class="p">[:</span><span class="n">p</span><span class="p">,</span> <span class="p">:</span><span class="n">p</span><span class="p">],</span> <span class="n">d</span><span class="p">,</span> <span class="n">trans</span><span class="o">=</span><span class="s1">'T'</span><span class="p">,</span> <span class="n">lower</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">Q</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">linalg</span><span class="o">.</span><span class="n">lstsq</span><span class="p">(</span><span class="n">A</span><span class="p">[:,</span> <span class="n">p</span><span class="p">:],</span> <span class="n">b</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">A</span><span class="p">[:,</span> <span class="p">:</span><span class="n">p</span><span class="p">],</span> <span class="n">y</span><span class="p">),</span> <span class="n">cond</span><span class="o">=</span><span class="n">cond</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">ravel</span><span class="p">()</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Q</span><span class="p">[:,</span> <span class="p">:</span><span class="n">p</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Q</span><span class="p">[:,</span> <span class="n">p</span><span class="p">:],</span> <span class="n">z</span><span class="p">)</span>
</pre></div>
<p><strong>Update: now scipy has a function qr_multiply which would
considerably speed up this code</strong></p>
A profiler for Python extensions2011-04-06T14:02:00+02:002011-04-06T14:02:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-06:/blog/2011/a-profiler-for-python-extensions/<p>Profiling Python extensions has not been a pleasant experience for me,
so I made my own package to do the job. Existing alternatives were
either hard to use, forcing you to recompile with custom flags like
gprofile or desperately slow like valgrind/callgrind. The package I'll
talk about is called …</p><p>Profiling Python extensions has not been a pleasant experience for me,
so I made my own package to do the job. Existing alternatives were
either hard to use, forcing you to recompile with custom flags like
gprofile or desperately slow like valgrind/callgrind. The package I'll
talk about is called <a class="reference external" href="http://pypi.python.org/pypi/yep">YEP</a> and is designed to be:</p>
<ol class="arabic simple">
<li>Unobtrusive: no recompiling, no custom linking. Just lauch & profile.</li>
<li>Fast: waiting sucks.</li>
<li>Easy to use.</li>
</ol>
<div class="section" id="basic-usage">
<h2>Basic usage</h2>
<p>YEP is distributed as a python module and can be <a class="reference external" href="http://pypi.python.org/pypi/yep">downloaded from the
pypi</a>. After installation, it is executed by giving the <strong>-m yep</strong>
flags to the interpreter. Without any arguments, it will just print a
help message:</p>
<div class="highlight"><pre><span></span>$<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>yep<span class="w"> </span>Usage:<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>yep<span class="w"> </span><span class="o">[</span>options<span class="o">]</span><span class="w"> </span>scriptfile<span class="w"> </span><span class="o">[</span>arg<span class="o">]</span><span class="w"> </span>...<span class="w"> </span>...
</pre></div>
<p>Say you want to profile a script called
<cite>my_script.py</cite>, then the way to quickly get a profiler report is to
execute:</p>
<div class="highlight"><pre><span></span>$<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>yep<span class="w"> </span>-v<span class="w"> </span>my_script.py
</pre></div>
<p>For example,
running YEP on <a class="reference external" href="http://scikit-learn.sourceforge.net/auto_examples/grid_search_digits.html">this example</a> that makes use of <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">libsvm</a>, a C++
library for Support Vector Machines, outputs</p>
<table border="1" class="docutils">
<colgroup>
<col width="100%" />
</colgroup>
<tbody valign="top">
<tr><td><img alt="image1" src="https://lh5.googleusercontent.com/_IOBIGAGXP4o/TZruzeuFJjI/AAAAAAAAAGI/JSmxqbOd0o4/s400/Screenshot-fabian%40localhost%3A%20-home-fabian.png" /></td>
</tr>
<tr><td>From <a class="reference external" href="https://picasaweb.google.com/fabian.pedregosa.izquierdo/Screenshots?feat=embedwebsite">Screenshots</a></td>
</tr>
</tbody>
</table>
<p>The last column prints the name of the functions, so just looking at
those that start with svm:: gives you an overview of how our libsvm is
spending its time.</p>
</div>
<div class="section" id="other-usages">
<h2>Other usages</h2>
<p>Calling YEP without the -v will create a my_script.py.prof file that
can be analyzed with pprof (google-pprof on some systems). pprof has a
huge range of options, letting you to filter on some funtions, output to
ghostview or print a line-by-line profiling, to mention a few. For
example, you can generate a call graph with the command: [cc] $ pprof
--gv /usr/bin/python my_script.py.prof [/cc]</p>
</div>
<div class="section" id="more-control">
<h2>More control</h2>
<p>If you would like to manually start/stop the profiler rather than
profile the whole script, you can use the functions yep.start() and
yep.stop() inside a python script. This will write the profile to a
given filename, so make sure the directory is writable: [cc
lang="python"] import yep yep.start('out.prof') # will create an
out.prof file # do something ... yep.stop() [/cc]</p>
</div>
<div class="section" id="future-work">
<h2>Future work</h2>
<p>The -v option showed at the beginning is just a dirty hack that launches
pprof and pipes the output into less. A more robust approach would be to
read the resulting profile from python and manipulate it from there,
either to std or to <a class="reference external" href="http://docs.python.org/library/profile.html#pstats.Stats">pstats</a> format. This shouldn't be too difficult as
the pprof format is described <a class="reference external" href="http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile-fileformat.html">here</a></p>
</div>
<div class="section" id="acknowledgment">
<h2>Acknowledgment</h2>
<p>The original idea to use google-perftools to profile Python extensions
was given on this <a class="reference external" href="http://stackoverflow.com/questions/2615153/profiling-python-c-extensions">Stack overflow question</a></p>
</div>
scikit-learn coding sprint in Paris2011-04-02T12:07:00+02:002011-04-02T12:07:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-04-02:/blog/2011/scikit-learn-coding-sprint-in-paris/<p>Yesterday was the scikit-learn coding sprint in Paris. It was great to
meet with old developers (Vincent Michel) and new ones: some of whom I
was already familiar with from the mailing list while others came just
to say hi and get familiar with the code. It was really great …</p><p>Yesterday was the scikit-learn coding sprint in Paris. It was great to
meet with old developers (Vincent Michel) and new ones: some of whom I
was already familiar with from the mailing list while others came just
to say hi and get familiar with the code. It was really great to have
people from such different backgrounds discuss on concrete problems and
getting things done. A lot of work was done, most of it unmerged yet,
but if I had to highlight the three most important for me, that would be
the the <a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/86">merge of the hcluster2 branch</a>, the awesome work of <a class="reference external" href="https://github.com/thouis">thouis</a>
in replacing the <a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/120">C++ interface to the ball_tree with a Cython one</a>
and suppport for Python3 (not bug-free but imports OK). As for me, I've
been working mostly in providing efficient cross-validatation for
Support Vector Machines. The status of this is: low-level API seems to
work fine (scikits.learn.svm.libsvm.cross_validation) but high-level
API <a class="reference external" href="https://github.com/scikit-learn/scikit-learn/pull/117">still needs some work</a>. This is the picture featuring (most) of
the people that were at the sprint around 16h in <a class="reference external" href="http://www.logilab.fr/">Logilab's</a>
headquarters.</p>
<img alt="" src="http://farm6.static.flickr.com/5092/5578952957_27b653d0a4.jpg" />
py3k in scikit-learn2011-03-28T15:23:00+02:002011-03-28T15:23:00+02:00Fabian Pedregosatag:fa.bianp.net,2011-03-28:/blog/2011/py3k-in-scikit-learn/<p>One thing I'd really like to see done in <a class="reference external" href="http://gael-varoquaux.info/blog/?p=149">this Friday's scikit-learn
sprint</a> is to have full support for Python 3. There's <a class="reference external" href="http://github.com/fabianp/scikit-learn/compare/master...py3k">a branch were
the hard word has been done</a> (porting C extensions, automatic 2to3
conversion, etc.), although joblib still has some bugs and no one has
attempted to …</p><p>One thing I'd really like to see done in <a class="reference external" href="http://gael-varoquaux.info/blog/?p=149">this Friday's scikit-learn
sprint</a> is to have full support for Python 3. There's <a class="reference external" href="http://github.com/fabianp/scikit-learn/compare/master...py3k">a branch were
the hard word has been done</a> (porting C extensions, automatic 2to3
conversion, etc.), although joblib still has some bugs and no one has
attempted to do anything serious with this branch yet ...</p>
Computing the vector norm2011-02-15T10:31:00+01:002011-02-15T10:31:00+01:00Fabian Pedregosatag:fa.bianp.net,2011-02-15:/blog/2011/computing-the-vector-norm/<p><strong>Update: a fast and stable norm was added to scipy.linalg in August
2011 and will be available in scipy 0.10</strong> Last week I discussed with
<a class="reference external" href="http://gael-varoquaux.info/blog/">Gael</a> how we should compute the euclidean norm of a vector a using
SciPy. Two approaches suggest themselves, either calling
scipy.linalg.norm …</p><p><strong>Update: a fast and stable norm was added to scipy.linalg in August
2011 and will be available in scipy 0.10</strong> Last week I discussed with
<a class="reference external" href="http://gael-varoquaux.info/blog/">Gael</a> how we should compute the euclidean norm of a vector a using
SciPy. Two approaches suggest themselves, either calling
scipy.linalg.norm(a) or computing sqrt(a.T a), but as I learned later,
both have issues. <strong>Note:</strong> I use single-precision arithmetic for
simplicity, but similar results hold for double-precision.</p>
<div class="section" id="overflow-and-underflow">
<h2>Overflow and underflow</h2>
<p>Both approaches behave terribly in presence of big or small numbers.
Take for example an array with a single entry:</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">0</span><span class="p">]:</span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1e20</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="n">a</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="n">array</span><span class="p">([</span><span class="mf">1.00000002e+20</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float32</span><span class="p">)</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">scipy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">inf</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">a</span><span class="p">))</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">inf</span>
</pre></div>
<p>That is, both methods return Infinity. However, the correct answer is
10^20, which would comfortably fit in a <a class="reference external" href="http://en.wikipedia.org/wiki/Single_precision_floating-point_format">single-precision</a>
instruction. Similar examples can be found where numbers underflow.</p>
</div>
<div class="section" id="stability">
<h2>Stability</h2>
<p>Again, scipy.linalg.norm has a terrible behavior in what concerns
numerical stability. In presence of different magnitudes severe
cancellation can occur. Take for example and array with one 10.000 in
the first value and 10.000 ones behind:</p>
<div class="highlight"><pre><span></span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1e4</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="mi">10000</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
</pre></div>
<p>In this case, scipy.linalg.norm will discard all the ones, producing</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">-</span> <span class="mf">1e4</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="mf">0.0</span>
</pre></div>
<p>when the correct answer is 0.5. In this case $\sqrt{a^T a}$ has a much
nicer behavior since results of a dot-product in single precision are
accumulated using double-precision (but if double-precision is used,
results won't be accumulated using quadruple-precision):</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">a</span><span class="p">))</span> <span class="o">-</span> <span class="mf">1e4</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="mf">0.5</span>
</pre></div>
</div>
<div class="section" id="blas-blas-blas">
<h2>BLAS BLAS BLAS ...</h2>
<p>The BLAS function <a class="reference external" href="http://www.netlib.org/blas/snrm2.f">nrm2</a> does automatic scaling of parameters rendering
it more stable and tolerant to overflow. Luckily, scipy provides a
mechanism to call some BLAS functions:</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="n">nrm2</span><span class="p">,</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">get_blas_funcs</span><span class="p">((</span><span class="s1">'nrm2'</span><span class="p">,),</span> <span class="p">(</span><span class="n">a</span><span class="p">,))</span>
</pre></div>
<p>Using this function, no overflow occurs (hurray!)</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">95</span><span class="p">]:</span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1e20</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">96</span><span class="p">]:</span> <span class="n">nrm2</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">96</span><span class="p">]:</span> <span class="mf">1.0000000200408773e+20</span>
</pre></div>
<p>and stability is greatly improved</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">99</span><span class="p">]:</span> <span class="n">nrm2</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">-</span> <span class="mf">1e4</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">99</span><span class="p">]:</span> <span class="mf">0.49998750062513864</span>
</pre></div>
<p><strong>Update</strong>: as of scipy 0.10, this function is used by scipy.linalg.norm .</p>
</div>
<div class="section" id="timing">
<h2>Timing</h2>
<p>Computing the 2-norm of an array is a very cheap operation, thus
computations are usually dominated by external factors, such as latency
of memory access or overhead in the Python/C layer. Experimental
benchmarks on an array of size 10^7 show that nrm2 is marginally slower
than $latex \sqrt{a^T a}$, because scaling has a cost, but is is also
more stable and less prone to overflow and underflow. It also shows that
scipy.linalg.norm is the slowest (and numerically worst!) of all.</p>
<table border="1" class="docutils">
<colgroup>
<col width="38%" />
<col width="25%" />
<col width="37%" />
</colgroup>
<tbody valign="top">
<tr><td>$\sqrt{a^T a}$</td>
<td>BLAS nrm2(a)</td>
<td>scipy.linalg.norm(a)</td>
</tr>
<tr><td>0.02</td>
<td>0.02</td>
<td>0.16</td>
</tr>
</tbody>
</table>
</p></div>
Smells like hacker spirit2011-02-11T09:50:00+01:002011-02-11T09:50:00+01:00Fabian Pedregosatag:fa.bianp.net,2011-02-11:/blog/2011/smells-like-hacker-spirit/<p>I was last weekend in <a class="reference external" href="http://fosdem.org/2011/">FOSDEM</a> presenting <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a> (<a class="reference external" href="http://fa.bianp.net/talks/fosdem-skl/">here are
the slides</a> I used at the Data Analytics Devroom). Kudos to <a class="reference external" href="http://twitter.com/#!/ogrisel">Olivier
Grisel</a> and all the people who organized such a fun and authentic
meeting!</p>
<p><img alt="image0" src="http://farm6.static.flickr.com/5136/5417861859_8480c65eed_m.jpg" /></p>
<p><img alt="image1" src="http://farm6.static.flickr.com/5294/5425114531_6eec316967_m.jpg" /></p>
New examples in scikits.learn 0.62010-12-31T13:55:00+01:002010-12-31T13:55:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-12-31:/blog/2010/new-examples-in-scikitslearn-06/<p>Latest release of <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a> comes with an <a class="reference external" href="http://scikit-learn.sourceforge.net/0.6/auto_examples/index.html">awesome collection of
examples</a>. These are some of my favorites:</p>
<div class="section" id="faces-recognition">
<h2>Faces recognition</h2>
<p><a class="reference external" href="http://scikit-learn.sourceforge.net/0.6/auto_examples/applications/plot_face_recognition.html">This example</a> by <a class="reference external" href="http://twitter.com/ogrisel/">Olivier Grisel</a>, downloads a 58MB faces dataset
from <a class="reference external" href="http://vis-www.cs.umass.edu/lfw/">Labeled Faces in the Wild</a>, and is able to perform PCA for
feature extraction and SVC for classification, yielding …</p></div><p>Latest release of <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a> comes with an <a class="reference external" href="http://scikit-learn.sourceforge.net/0.6/auto_examples/index.html">awesome collection of
examples</a>. These are some of my favorites:</p>
<div class="section" id="faces-recognition">
<h2>Faces recognition</h2>
<p><a class="reference external" href="http://scikit-learn.sourceforge.net/0.6/auto_examples/applications/plot_face_recognition.html">This example</a> by <a class="reference external" href="http://twitter.com/ogrisel/">Olivier Grisel</a>, downloads a 58MB faces dataset
from <a class="reference external" href="http://vis-www.cs.umass.edu/lfw/">Labeled Faces in the Wild</a>, and is able to perform PCA for
feature extraction and SVC for classification, yielding a very
acceptable 0.85 f1-score.</p>
<img alt="" src="http://scikit-learn.sourceforge.net/0.6/_images/plot_face_recognition.png" />
</div>
<div class="section" id="species-distribution-modeling">
<h2>Species distribution modeling</h2>
<p><a class="reference external" href="http://scikit-learn.sourceforge.net/0.6/auto_examples/applications/plot_species_distribution_modeling.html">This example2</a> by <a class="reference external" href="http://sites.google.com/site/peterprettenhofer/">Peter Prettenhofer</a>, models the geographical
distribution of two south american mammals given past observations and
14 environmental variables.</p>
<img alt="" src="http://scikit-learn.sourceforge.net/0.6/_images/plot_species_distribution_modeling.png" />
</div>
<div class="section" id="libsvm-gui">
<h2>Libsvm GUI</h2>
<p><a class="reference external" href="http://scikit-learn.sourceforge.net/0.6/auto_examples/applications/svm_gui.html">This other example</a>, again by <a class="reference external" href="http://sites.google.com/site/peterprettenhofer/">Peter Prettenhofer</a> and based on matplotlib
and Tk, lets you draw data points in a canvas and it will interactively
show the decision function of the SVM classifier. See <a class="reference external" href="http://vimeo.com/18308519">this video</a> for
a small showcase (music by <a class="reference external" href="http://www.crepus.com/supercrepus.html">Joe Crepúsculo</a> can be downloaded <a class="reference external" href="http://www.crepus.com/Supercrepus.rar">here</a>)</p>
</div>
Weighted samples for SVMs2010-11-29T13:20:00+01:002010-11-29T13:20:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-11-29:/blog/2010/weighted-samples-for-svms/<p>Based on the work of <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances">libsvm-dense</a> by Ming-Wei Chang, Hsuan-Tien Lin,
Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu I patched the libsvm
distribution shipped with scikits.learn to allow setting weights for
individual instances. The motivation behind this is to be able force a
classifier to focus its attention in …</p><p>Based on the work of <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances">libsvm-dense</a> by Ming-Wei Chang, Hsuan-Tien Lin,
Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu I patched the libsvm
distribution shipped with scikits.learn to allow setting weights for
individual instances. The motivation behind this is to be able force a
classifier to focus its attention in some samples instead of others.
<a class="reference external" href="http://scikit-learn.sourceforge.net/auto_examples/svm/plot_weighted_samples.html">This example</a> shows how different weights modify the decision
function:</p>
<p><img alt="image0" src="http://lh5.ggpht.com/_IOBIGAGXP4o/TPOK1z_KKNI/AAAAAAAAADQ/DNZCKc4Zt3w/s400/weights1.png" /></p>
<p><img alt="image1" src="http://lh4.ggpht.com/_IOBIGAGXP4o/TPOK2B9kUAI/AAAAAAAAADU/68dOJ6Bm3eY/s400/weights2.png" /></p>
<p><img alt="image2" src="http://lh5.ggpht.com/_IOBIGAGXP4o/TPOK2UXIRlI/AAAAAAAAADY/xKjk2HKHLdc/s400/weights3.png" /></p>
Coming soon ...2010-11-24T10:39:00+01:002010-11-24T10:39:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-11-24:/blog/2010/coming-soon/<img alt="" src="http://farm5.static.flickr.com/4107/5203822436_41b9c350c2.jpg" />
<p>Highlights for this release: * New <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/sgd.html">stochastic
gradient descent module</a> by <a class="reference external" href="http://sites.google.com/site/peterprettenhofer/">Peter Prettenhofer</a> * Improved svm
module: memory efficiency, automatic class weights. * Wrap for
liblinear's Multi-class SVC (option multi_class in <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/generated/scikits.learn.svm.LinearSVC.html">LinearSVC</a>) * New
features and performance improvements of text feature extraction. *
Improved sparse matrix support, both in main classes (GridSearch) as in
sparse …</p><img alt="" src="http://farm5.static.flickr.com/4107/5203822436_41b9c350c2.jpg" />
<p>Highlights for this release: * New <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/sgd.html">stochastic
gradient descent module</a> by <a class="reference external" href="http://sites.google.com/site/peterprettenhofer/">Peter Prettenhofer</a> * Improved svm
module: memory efficiency, automatic class weights. * Wrap for
liblinear's Multi-class SVC (option multi_class in <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/generated/scikits.learn.svm.LinearSVC.html">LinearSVC</a>) * New
features and performance improvements of text feature extraction. *
Improved sparse matrix support, both in main classes (GridSearch) as in
sparse modules: scikits.learn.svm.sparse and scikits.learn.glm.sparse.
* Lots of cool new examples: (<a class="reference external" href="https://github.com/scikit-learn/scikit-learn/blob/master/examples/svm/svm_gui.py">here1</a>, <a class="reference external" href="https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_species_distribution_modeling.py">here2</a> and <a class="reference external" href="https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_face_recognition.py">here3</a>) * New
Gaussian Process module by <a class="reference external" href="https://github.com/dubourg">Vincent Dubourg</a> (still to be merged) *
Faster implementation of the <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/glm.html#lars-algorithm-and-its-variants">LARS algorithm</a>. * Probability estimates
for logistic regression. * Lots of bug fixes and documentation
improvements. * Probably other things I am forgetting ...</p>
memory efficient bindigs for libsvm2010-11-19T15:08:00+01:002010-11-19T15:08:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-11-19:/blog/2010/memory-efficient-bindigs-for-libsvm/<p><a class="reference external" href="http://scikit-learn.sf.net">scikits.learn.svm</a> now uses <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_dense_data">LibSVM-dense</a> instead of <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">LibSVM</a> for
some support vector machine related algorithms when input is a dense
matrix. As a result most of the copies associated with argument passing
are avoided, giving 50% less memory footprint and several times less
than the python bindings that ship …</p><p><a class="reference external" href="http://scikit-learn.sf.net">scikits.learn.svm</a> now uses <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_dense_data">LibSVM-dense</a> instead of <a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">LibSVM</a> for
some support vector machine related algorithms when input is a dense
matrix. As a result most of the copies associated with argument passing
are avoided, giving 50% less memory footprint and several times less
than the python bindings that ship with libsvm, which stores data in the
very inefficient python list structure. On the performance side I didn't
see any significant difference, although on large datasets less memory
footprint can make the difference between swapping or not.</p>
solve triangular matrices using scipy.linalg2010-10-30T01:13:00+02:002010-10-30T01:13:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-10-30:/blog/2010/solve-triangular-matrices-using-scipylinalg/<p>For some time now I've been missing a function in scipy that exploits
the triangular structure of a matrix to efficiently solve the associated
system, so I decided to <a class="reference external" href="http://projects.scipy.org/scipy/changeset/6844">implement it</a> by binding the LAPACK method
"trtrs", which also checks for singularities and is capable handling
several right-hand sides. Contrary …</p><p>For some time now I've been missing a function in scipy that exploits
the triangular structure of a matrix to efficiently solve the associated
system, so I decided to <a class="reference external" href="http://projects.scipy.org/scipy/changeset/6844">implement it</a> by binding the LAPACK method
"trtrs", which also checks for singularities and is capable handling
several right-hand sides. Contrary to what I expected, binding Fortran
code with f2py is pretty straightforward, even for someone like me who
has never programmed in that language: I took a similar example,
modified it's parameters and it worked! Also, thanks to Pauli Virtanen
the review process was really fast and the patch was committed within a
few hours. The high level interface for LAPACK's trtrs is
linalg.solve_triangular, which accepts roughly the same arguments as
linalg.solve, but assumes the first argument is a triangular matrix: [cc
lang="python"] In [1]: from scipy import linalg In [2]:
linalg.solve_triangular([[1, 1], [0, 1]], [0, 1]) Out[2]: array([-1.,
1.]) [/cc] Simple <a class="reference external" href="http://gist.github.com/654407">benchmarks</a> lets us clearly appreciate the
complexity gap between both methods : solving an (n, n) triangular
system is an O(n^2) operation, while solving a full one is at least a
O(n^3):</p>
<img alt="" src="http://lh3.ggpht.com/_IOBIGAGXP4o/TMs3PvgFIwI/AAAAAAAAABA/ImOSqSZmljA/s400/works.png" />
LARS algorithm2010-09-30T16:01:00+02:002010-09-30T16:01:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-09-30:/blog/2010/lars-algorithm/<p>I've been working lately with <a class="reference external" href="http://www-sop.inria.fr/members/Alexandre.Gramfort/">Alexandre Gramfort</a> coding the <a class="reference external" href="http://scikit-learn.sf.net/modules/glm.html#lars-algorithm-and-its-variants">LARS
algorithm</a> in <a class="reference external" href="hhtp://scikit-learn.sf.net">scikits.learn</a>. This algorithm computes the solution to
several general linear models used in machine learning: LAR, Lasso,
Elasticnet and Forward Stagewise. Unlike the implementation by
coordinate descent, the LARS algorithm gives the full coefficient path
along the …</p><p>I've been working lately with <a class="reference external" href="http://www-sop.inria.fr/members/Alexandre.Gramfort/">Alexandre Gramfort</a> coding the <a class="reference external" href="http://scikit-learn.sf.net/modules/glm.html#lars-algorithm-and-its-variants">LARS
algorithm</a> in <a class="reference external" href="hhtp://scikit-learn.sf.net">scikits.learn</a>. This algorithm computes the solution to
several general linear models used in machine learning: LAR, Lasso,
Elasticnet and Forward Stagewise. Unlike the implementation by
coordinate descent, the LARS algorithm gives the full coefficient path
along the regularization parameter, and thus it is specially well suited
for performing model selection.</p>
<img alt="" src="http://scikit-learn.sourceforge.net/_images/plot_lasso_lars.png" />
<p>The algorithm is coded
mostly in python, with some tiny parts in C (because I already had the
code for Cholesky deletes in C) and a cython interface for the BLAS
function dtrsv, which will be proposed to scipy once I stabilize this
code. The algorithm is mostly complete, allowing some optimizations,
like using a precomputed Gram matrix or specify maximum number of
features/iterations, but could still be extended to compute other
models, like ElasticNet or Forward Stagewise. I haven't done any
benchmarks yet, but preliminary ones by Alexandre Gramfort showed that
it is roughly equivalent to this <a class="reference external" href="http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3897">Matlab implementation</a>. Using
<a class="reference external" href="http://pymvpa.org">PyMVPA</a>, it shouldn't be difficult to benchmark it against the R
implementation, though.</p>
Second scikits.learn coding sprint2010-09-12T22:31:00+02:002010-09-12T22:31:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-09-12:/blog/2010/second-scikitslearn-coding-sprint/<p>Las week took place in Paris the second <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a> sprint. It was
two days of insane activity (115 commits, 6 branches, 33 coffees) in
which we did a lot of work, both implementing new algorithms and fixing
or improving old ones. This includes: * sparse version of Lasso by
coordinate …</p><p>Las week took place in Paris the second <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a> sprint. It was
two days of insane activity (115 commits, 6 branches, 33 coffees) in
which we did a lot of work, both implementing new algorithms and fixing
or improving old ones. This includes: * sparse version of Lasso by
coordinate descent. Not (yet) merged into master, but can be looked from
<a class="reference external" href="http://github.com/ogrisel/scikit-learn/tree/issue-77-sparse-cd">Olivier's branch</a>. * new API for Pipeline. An example of this can be
found in the document <a class="reference external" href="http://scikit-learn.sourceforge.net/auto_examples/svm/plot_svm_anova.html">SVM-Anova: SVM with univariate feature
selection</a>. * documentation for the <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/glm.html#bayesian-regression">bayesian methods</a> and <a class="reference external" href="http://scikit-learn.sourceforge.net/cross_validation.html">cross
validation</a>: Vincent Michel contributed a lot of documentation, mainly
taken from chapters of his thesis. * <a class="reference external" href="http://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/covariance/ledoit_wolf.py">Ledoit-Wolf covariance
estimation</a>. * Pure python <a class="reference external" href="http://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/fastica.py">Fast ICA</a> implementation. And the family
picture, featuring (from left to right): <a class="reference external" href="http://www-sop.inria.fr/members/Alexandre.Gramfort/index.fr.html">Alexandre Gramfort</a>,
<a class="reference external" href="http://parietal.saclay.inria.fr/Members/bertrand-thirion">Bertrand Thirion</a>, <a class="reference external" href="http://parietal.saclay.inria.fr/Members/virgile-fritsch">Virgine Fritsch</a>, <a class="reference external" href="http://gael-varoquaux.info/">Gael Varoquaux</a>, <a class="reference external" href="http://parietal.saclay.inria.fr/Members/vincent-michel">Vincent
Michel</a>, <a class="reference external" href="http://github.com/ogrisel">Olivier Grisel</a> and me (taking the picture).</p>
<img alt="" src="http://farm5.static.flickr.com/4135/4974339970_566424185f.jpg" />
Support for sparse matrices in scikits.learn2010-08-23T17:47:00+02:002010-08-23T17:47:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-08-23:/blog/2010/support-for-sparse-matrices-in-scikitslearn/<p>I recently added support for sparse matrices (as defined in
scipy.sparse) in some classifiers of <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a>. In those classes,
the fit method will perform the algorithm without converting to a dense
representation and will also store parameters in an efficient format.
Right now, the only classese that implements …</p><p>I recently added support for sparse matrices (as defined in
scipy.sparse) in some classifiers of <a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a>. In those classes,
the fit method will perform the algorithm without converting to a dense
representation and will also store parameters in an efficient format.
Right now, the only classese that implements this is SVC and LinearSVC
in scikits.learn.svm.sparse, although the plan is to add more classes in
the future. These are capable of taking sparse matrices in the fit()
method and will also store support vectors as sparse matrices. Here is
an example. We first create a toy dataset and import relevant modules:</p>
<p>[cc lang="python"]
In [1]: import scipy.sparse
In [2]: from scikits.learn. import svm
In [3]: X, Y = scipy.sparse.csr_matrix([[0,0], [0, 1]]), [0, 1]
In [4]: clf = svm.sparse.SVC(kernel='linear') [/cc]</p>
<p>now we will fit the model and query some of its parameters: [cc
lang="python"] In [5]: clf.fit(X, Y) Out[5]: SVC(kernel='linear', C=1.0,
probability=0, shrinking=1, eps=0.001, cache_size=100.0, coef0=0.0,
gamma=0.0) In [6]: clf.support_ Out[6]: <2x2 sparse matrix of type ''
with 1 stored elements in Compressed Sparse Row format> In [7]:
clf.coef_ Out[7]: <1x2 sparse matrix of type '' with 1 stored elements
in Compressed Sparse Row format> [/cc] For a more complete example, you
can look at Classification of text documents using sparse features,
contributed by Olivier Grisel.</p>
Flags to debug python C extensions.2010-08-18T13:40:00+02:002010-08-18T13:40:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-08-18:/blog/2010/flags-to-debug-python-c-extensions/<p>I often find myself debugging python C extensions from gdb, but usually
some variables are hidden because aggressive optimizations that
distutils sets by default. What I did not know, is that you can prevent
those optimizations by passing flags -O0 -fno-inline to gcc in keyword
extra_compile_args (note: this will only …</p><p>I often find myself debugging python C extensions from gdb, but usually
some variables are hidden because aggressive optimizations that
distutils sets by default. What I did not know, is that you can prevent
those optimizations by passing flags -O0 -fno-inline to gcc in keyword
extra_compile_args (note: this will only work in GCC). A complete
example would look like:</p>
<div class="highlight"><pre><span></span><span class="n">config</span><span class="o">.</span><span class="n">add_extension</span><span class="p">(</span><span class="s1">'foo'</span><span class="p">,</span>
<span class="n">sources</span><span class="o">=</span><span class="p">[</span><span class="s1">'a.c'</span><span class="p">],</span> <span class="c1"># add this for gdb debug</span>
<span class="n">extra_compile_args</span><span class="o">=</span><span class="p">[</span><span class="s1">'-O0 -fno-inline'</span><span class="p">])</span>
</pre></div>
<p>and it becomes much easier to debug from gdb.</p>
July in Paris2010-07-30T00:11:00+02:002010-07-30T00:11:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-07-30:/blog/2010/july-in-paris/<p>One of the best things of spending summer in Paris: its parcs (here,
with friends @ Parc Montsouris).</p>
<p><img alt="image0" src="http://farm5.static.flickr.com/4103/4842146900_953f961d64.jpg" /></p>
Support Vector machines with custom kernels using scikits.learn2010-05-27T10:42:00+02:002010-05-27T10:42:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-05-27:/blog/2010/support-vector-machines-with-custom-kernels-using-scikitslearn/<p>It is now possible (using the development version as of may 2010) to use
Support Vector Machines with custom kernels in scikits.learn. How to use
it couldn't be more simple: you just pass a callable (the kernel) to the
class constructor). For example, a linear kernel would be implemented …</p><p>It is now possible (using the development version as of may 2010) to use
Support Vector Machines with custom kernels in scikits.learn. How to use
it couldn't be more simple: you just pass a callable (the kernel) to the
class constructor). For example, a linear kernel would be implemented as
follows: [cc lang="python"] import numpy as np def my_kernel(x, y):
return np.dot(x, y.T) [/cc] The only requisites for defining a kernel is
that it should take as argument two numpy arrays and return also a numpy
array. Then you would pass the kernel to the classifier's constructor:
[cc lang="python"] from scikits.learn import svm clf =
svm.SVC(kernel=my_kernel) [/cc] and that's all. The construct
recognizes this as a custom kernel and you can then use the classifier
as any other classifier. [cc lang="python"] clf.fit([[0, 0], [1, 1]],
[0, 1]) print clf.predict([[0, 0]]) --> [0.] [/cc] For a complete
reference, see the <a class="reference external" href="http://scikit-learn.sourceforge.net/modules/svm.html#using-custom-kernels">the reference manual</a> and <a class="reference external" href="http://scikit-learn.sourceforge.net/auto_examples/svm/plot_custom_kernel.html#example-svm-plot-custom-kernel-py">an example</a>.</p>
Howto link against system-wide BLAS library using numpy.distutils2010-04-22T14:28:00+02:002010-04-22T14:28:00+02:00Fabian Pedregosatag:fa.bianp.net,2010-04-22:/blog/2010/howto-link-against-system-wide-blas-library-using-numpydistutils/<p>If your numpy installation uses system-wide BLAS libraries (this will
most likely be the case unless you installed it through prebuilt windows
binaries), you can retrieve this information at compile time to link
python modules to BLAS. The function get_info in
numpy.distutils.system_info will return a dictionary that contains …</p><p>If your numpy installation uses system-wide BLAS libraries (this will
most likely be the case unless you installed it through prebuilt windows
binaries), you can retrieve this information at compile time to link
python modules to BLAS. The function get_info in
numpy.distutils.system_info will return a dictionary that contains the
needed information to link against BLAS or an empty dict if no
system-wide BLAS could be found. For example, MacOSX ships with it's own
optimized BLAS routines, and get_info correctly reports that: [cc
lang="python"] In [1]: from numpy.distutils.system_info import
get_info In [2]: get_info('blas_opt') Out[2]: {'define_macros':
[('NO_ATLAS_INFO', 3)], 'extra_compile_args': ['-msse3',
'-I/System/Library/Frameworks/vecLib.framework/Headers'],
'extra_link_args': ['-Wl,-framework', '-Wl,Accelerate']} [/cc] The
following example shows a setup.py that links against system-wide BLAS
if possible. If no appropriate BLAS routine could be found, it will
print a warning message, but will compile it's own BLAS routine and
embed it in the python extension. [cc lang="python"] from os.path import
join def configuration(parent_package='', top_path=None): import
warnings from numpy.distutils.misc_util import Configuration from
numpy.distutils.system_info import get_info, BlasNotFoundError config
= Configuration('foo', parent_package, top_path) libfoo_files =
['foo.c'] blas_sources = [join('blas', 'daxpy.c'), join('blas',
'dscal.c')] blas_info = get_info('blas_opt') if not blas_info:
warnings.warn(BlasNotFoundError.__doc__)
libfoo_files.append(blas_sources) libraries =
blas_info.pop('libraries', []) include_dirs =
blas_info.pop('include_dirs', []) config.add_extension('foo',
sources=sources, libraries=libraries, include_dirs=include_dirs,
**blas_info ) return config if __name__ == '__main__': from
numpy.distutils.core import setup
setup(**configuration(top_path='').todict()) [/cc] A real-word
example of this can be found in scipy.odr module and in
<a class="reference external" href="http://scikit-learn.sourceforge.net">scikits.learn's</a> liblinear bindings.</p>
scikits.learn 0.2 release2010-03-22T11:37:00+01:002010-03-22T11:37:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-03-22:/blog/2010/scikitslearn-02-release/<p>Today I released a new version of the <a class="reference external" href="http://scikit-learn.sourceforge.net">scikits.learn</a> library for
machine learning. This new release includes the new libsvm bindings,
Jake VanderPlas' BallTree algorithm for *fast* nearest neighbor
queries in high dimension, etc. <a class="reference external" href="http://sourceforge.net/mailarchive/message.php?msg_name=4BA72BE3.1010208%40inria.fr">Here</a> is the official announcement. As
usual, it can be downloaded from <a class="reference external" href="http://sourceforge.net/projects/scikit-learn/files">sourceforge</a> or from …</p><p>Today I released a new version of the <a class="reference external" href="http://scikit-learn.sourceforge.net">scikits.learn</a> library for
machine learning. This new release includes the new libsvm bindings,
Jake VanderPlas' BallTree algorithm for *fast* nearest neighbor
queries in high dimension, etc. <a class="reference external" href="http://sourceforge.net/mailarchive/message.php?msg_name=4BA72BE3.1010208%40inria.fr">Here</a> is the official announcement. As
usual, it can be downloaded from <a class="reference external" href="http://sourceforge.net/projects/scikit-learn/files">sourceforge</a> or from <a class="reference external" href="http://pypi.python.org/pypi/scikits.learn">the PyPI</a>.</p>
Plot the maximum margin hyperplane with scikits.learn2010-03-17T12:24:00+01:002010-03-17T12:24:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-03-17:/blog/2010/plot-the-maximum-margin-hyperplane-with-scikitslearn/<p>Suppose some given data points each belong to one of two classes, and
the goal is to decide which class a new data point will be in. In the
case of support vector machines, a data point is viewed as a
p-dimensional vector (2-dimensional in this example), and we want …</p><p>Suppose some given data points each belong to one of two classes, and
the goal is to decide which class a new data point will be in. In the
case of support vector machines, a data point is viewed as a
p-dimensional vector (2-dimensional in this example), and we want to
know whether we can separate such points with a p -1-dimensional
hyperplane (a line in our case). There are many hyperplanes that might
classify the data. One reasonable choice as the best hyperplane is the
one that represents the largest separation, or margin, between the two
classes. So we choose the hyperplane so that the distance from it to the
nearest data point on each side is maximized. If such a hyperplane
exists, it is known as the maximum-margin hyperplane and the linear
classifier it defines is known as a maximum margin classifier. Using the
new svm module in <a class="reference external" href="http://scikit-learn.sourceforge.net">scikits.learn</a>, you can easily plot the maximum
margin hyperplane. If clf is an instance of svm.SVC(), the coefficients
in the decision function are stored in clf.coef_ , and the independent
term in clf.rho_ . The complete source code is: [cc lang="python"]
import numpy as np import pylab as pl from scikits.learn import svm # we
create 40 separable points np.random.seed(0) X =
np.r_[np.random.randn(20, 2) - [2,2], np.random.randn(20, 2) + [2, 2]]
Y = [0]*20 + [1]*20 # fit the model clf = svm.SVC(kernel='linear')
clf.fit(X, Y) # get the separating hyperplane w = np.dot(clf.coef_[0],
clf.support_) a = -w[0]/w[1] xx = np.linspace(-5, 5) yy = a*xx +
(clf.rho_[0])/w[1] # plot the paralels to the separating hyperplane
that pass through the # support vectors b = clf.support_[0] yy_down =
a*xx + (b[1] - a*b[0]) b = clf.support_[-1] yy_up = a*xx + (b[1] -
a*b[0]) # plot the line, the points, and the nearest vectors to the
plane pl.set_cmap(pl.cm.Paired) pl.plot(xx, yy, 'k-') pl.plot(xx,
yy_down, 'k--') pl.plot(xx, yy_up, 'k--') pl.scatter(X[:,0], X[:,1],
c=Y) pl.scatter(clf.support_[:,0], clf.support_[:,1], marker='+')
pl.axis('tight') pl.show() [/cc] And the result is</p>
<img alt="" src="http://farm5.static.flickr.com/4030/4442992244_6edb72a83a.jpg" />
<p>where the vectors that are closest to the separating line
are highlighted with a small '+'. Up-to date code of this can be found
in directory examples/ of scikits.learn</p>
<img alt="" src="http://farm5.static.flickr.com/4030/4442992244_6edb72a83a.jpg" />
Fast bindings for LibSVM in scikits.learn2010-03-09T15:49:00+01:002010-03-09T15:49:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-03-09:/blog/2010/fast-bindings-for-libsvm-in-scikitslearn/<p><a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">LibSVM</a> is a C++ library that implements several Support Vector
Machine algorithms that are commonly used in machine learning. It is a
fast library that has no dependencies and most machine learning
frameworks bind it in some way or another. LibSVM comes with a Python
interface written in swig, but …</p><p><a class="reference external" href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">LibSVM</a> is a C++ library that implements several Support Vector
Machine algorithms that are commonly used in machine learning. It is a
fast library that has no dependencies and most machine learning
frameworks bind it in some way or another. LibSVM comes with a Python
interface written in swig, but this interface is inherently slow as it
does not take into account numpy's array structure. Also, it does not
wrap all the library's functionality. Some projects bind it using this
bindings and other (such as PyMVPA) make its own wrap, binding some
methods directly to numpy's array structure. My approach was to code
<a class="reference external" href="http://github.com/fseoane/scikit-learn/blob/master/scikits/learn/src/libsvm_helper.c">all algorithms that convert</a> libsvm's data structures (sparse) to
numpy arrays (dense) in pure C and wrap them in a very thin Cython
layer. Special attention was given to minimize the overhead of
converting between libsvm data structures and numpy arrays, as in my
opinion this was the main source of bad performance in existing python
bindings.</p>
<div class="section" id="benchmarks">
<h2>Benchmarks</h2>
<p>As a first benchmark, I supposed a situation in which the dimension of
the subspace is small and there are lots of points to classify. This is
typically the case when your data is points in plane or in space and you
want to draw the decision function by classifying every point in the
grid. In this case, the bottleneck is not the classification algorithm,
but the conversion of data from a dense representation used by python
and numpy and a sparse representation used by libsvm. Not surprisingly,
we get huge performance gains if we speed up the conversion
dense/sparse.</p>
<img alt="" src="http://farm3.static.flickr.com/2745/4419953422_068e443a75.jpg" />
</div>
<div class="section" id="curse-of-dimensionality">
<h2>Curse of dimensionality</h2>
<p>In the case of a huge number of dimensions, the speedup is not so
spectacular, but we also get a performance boost by making training
somewhat faster.</p>
<img alt="" src="http://farm3.static.flickr.com/2696/4419195551_cee4aed9cf.jpg" />
</div>
<div class="section" id="bidirectional-mapping">
<h2>Bidirectional mapping</h2>
<p>A feature that was needed and that I haven't found on other
implementations is that you can tweak parameters in the SVM class and
the classifier will reflect those changes (i.e. parameters are actually
copied back and forth, not just passed as an opaque pointer). Suppose
you train an instance of the classifier and are interested in the
coefficients that multiply the support vectors in the decision function.
In scikits.learn, you can access this array under field .coef_:
`` >>> import numpy as np >>> from scikits.learn import svm >>> clf = svm.SVM() >>> clf.fit([[1,2], [3,4]], [-1, 1]) >>> clf.coef clf.coef0 clf.coef_ >>> clf.coef_ array([[ 1., -1.]])``
Now, changing the value of these coefficients effectively changes the
decision function:
`` >>> clf.predict([[1,2]]) array([ -1.]) >>> clf.coef_ = np.array([[0.0, -1.0]]) >>> clf.predict([[1,2]]) array([ 1.])``</p>
</div>
<div class="section" id="code">
<h2>Code</h2>
<p>All code can be found in the <a class="reference external" href="http://scikit-learn.sf.net">scikit</a> (you'll have to get the svn
version), in file scikits/learn/svm.py and scikits/learn/src/. All plots
are generated from <a class="reference external" href="http://github.com/fseoane/scikit-learn/tree/master/scikits/learn/benchmarks/bench_svm.py">this script</a></p>
</div>
<div class="section" id="notes">
<h2>Notes</h2>
<p>In the benchmarks, a Linear Kernel was used, as it is the most common.
Other more computationally intensive kernels would probably narrow the
difference.</p>
</div>
<div class="section" id="bugs">
<h2>Bugs</h2>
<p>This code should be treated as alpha quality and has not being
extensively tested. Please report any bugs that you encounter to <a class="reference external" href="https://sourceforge.net/apps/trac/scikit-learn/report">the
tracker</a></p>
</p></div>
scikits.learn coding sprint in Paris2010-03-04T11:25:00+01:002010-03-04T11:25:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-03-04:/blog/2010/scikitslearn-coding-sprint-in-paris/<p>Yesterday we had an extremely productive coding sprint for the
<a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a>. The idea was to put people with common interests in a
room and make them work in a single codebase. Alexandre Gramfort and
Olivier Grisel worked on <a class="reference external" href="http://scikit-learn.svn.sourceforge.net/viewvc/scikit-learn/trunk/scikits/learn/glm/">GLMNet</a>, Bertrand Thirion and Gaël Varoquaux
worked on <a class="reference external" href="http://scikit-learn.svn.sourceforge.net/viewvc/scikit-learn/trunk/scikits/learn/feature_selection/">univariate feature selection …</a></p><p>Yesterday we had an extremely productive coding sprint for the
<a class="reference external" href="http://scikit-learn.sf.net">scikits.learn</a>. The idea was to put people with common interests in a
room and make them work in a single codebase. Alexandre Gramfort and
Olivier Grisel worked on <a class="reference external" href="http://scikit-learn.svn.sourceforge.net/viewvc/scikit-learn/trunk/scikits/learn/glm/">GLMNet</a>, Bertrand Thirion and Gaël Varoquaux
worked on <a class="reference external" href="http://scikit-learn.svn.sourceforge.net/viewvc/scikit-learn/trunk/scikits/learn/feature_selection/">univariate feature selection</a> and Vincent worked on
<a class="reference external" href="http://scikit-learn.svn.sourceforge.net/viewvc/scikit-learn/trunk/scikits/learn/bayes/">Bayesian Regression</a>.</p>
<img alt="" src="http://farm5.static.flickr.com/4067/4405351641_5675ba000c.jpg" />
<p>I was supposed to work
with Vincent, but as soon as Bertrand spot some bugs in my libsvm
bindings, I could not think of anything except that, and eventually the
day finished just as I fixed the bug ... You can find some cool examples
of the things we did in directory <a class="reference external" href="http://scikit-learn.svn.sourceforge.net/viewvc/scikit-learn/trunk/examples/">examples</a>:</p>
<img alt="" src="http://farm5.static.flickr.com/4069/4405398285_ec2e4bf805.jpg" />
<img alt="" src="http://farm3.static.flickr.com/2707/4405404339_550bc56071.jpg" />
Scikit-learn 0.12010-02-01T15:32:00+01:002010-02-01T15:32:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-02-01:/blog/2010/scikit-learn-01/<p>Today I released the first public version of <a class="reference external" href="http://scikit-learn.sourceforge.net">Scikit-Learn</a> (<a class="reference external" href="https://sourceforge.net/mailarchive/message.php?msg_name=4B66D190.5090100%40inria.fr">release
notes</a>). It's a python module implementing some machine learning
algorithms, and it's shaping quite good.</p>
<p>For this release I did not want to do any incompatible changes, so most of them are just bug fixes and
updates. For the next …</p><p>Today I released the first public version of <a class="reference external" href="http://scikit-learn.sourceforge.net">Scikit-Learn</a> (<a class="reference external" href="https://sourceforge.net/mailarchive/message.php?msg_name=4B66D190.5090100%40inria.fr">release
notes</a>). It's a python module implementing some machine learning
algorithms, and it's shaping quite good.</p>
<p>For this release I did not want to do any incompatible changes, so most of them are just bug fixes and
updates. For the next release, however, some more radical changes are
planned, and definitely something should be done about the (incredibly
long) namespace, having to tape
<tt class="docutils literal">from scikits.learn.machine.manifold_learning.regression.neighbors import Neighbors</tt>
each time you want to perform a nearest-neighbor algorithms is just not
practical! Here is a nice screenshot,</p>
<img alt="" src="http://farm5.static.flickr.com/4006/4321591663_4134e3a095.jpg" />
scikit-learn project on sourceforge2010-01-07T15:17:00+01:002010-01-07T15:17:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-01-07:/blog/2010/scikit-learn-project-on-sourceforge/<p>This week we created a <a class="reference external" href="https://sourceforge.net/projects/scikit-learn/">sourceforge project</a> to host our development of
scikit-learn. Although the project already had a directory in scipy's
repo, we needed more flexibility in the user management and in the
mailing list creation, so we opted for SourceForge. To be honest, after
using git and Google …</p><p>This week we created a <a class="reference external" href="https://sourceforge.net/projects/scikit-learn/">sourceforge project</a> to host our development of
scikit-learn. Although the project already had a directory in scipy's
repo, we needed more flexibility in the user management and in the
mailing list creation, so we opted for SourceForge. To be honest, after
using git and Google Code for bug tracking, I was not very excited about
using subversion/sourceforge again. On the other hand, we needed some
sort of compromise that would allow a very heterogeneous range of
developers to work together, and after some (surprisingly civilized)
emails and some chatting with Gael, we agreed that SourceForge was
indeed the best choice. In case you are interested, there's a
(preliminary) <a class="reference external" href="http://scikit-learn.sourceforge.net/">web page</a> with more info. You might also want to have a
look at the previous project's <a class="reference external" href="http://www.scipy.org/scipy/scikits/wiki/MachineLearning">scipy web page</a>.</p>
After holidays2010-01-05T10:56:00+01:002010-01-05T10:56:00+01:00Fabian Pedregosatag:fa.bianp.net,2010-01-05:/blog/2010/after-holidays/<p>New job, new code, new city, new colleagues. Feels something like this:</p>
<img alt="" src="http://farm5.static.flickr.com/4027/4240407852_6f461f3776_m.jpg" />
Winter in Paris is not funny2009-12-22T19:36:00+01:002009-12-22T19:36:00+01:00Fabian Pedregosatag:fa.bianp.net,2009-12-22:/blog/2009/winter-in-paris-is-not-funny/<p>This week I arrived to the place where I will be working the following
two years: Neurospin.</p>
<img alt="" src="http://farm5.static.flickr.com/4042/4206517312_e35b7fa55d_m.jpg" />
<p>It's a research center located 20
km from Paris, and so far things are going smoothly: the place is
beautiful, work is great and food is excellent. Well OK, I do miss some …</p><p>This week I arrived to the place where I will be working the following
two years: Neurospin.</p>
<img alt="" src="http://farm5.static.flickr.com/4042/4206517312_e35b7fa55d_m.jpg" />
<p>It's a research center located 20
km from Paris, and so far things are going smoothly: the place is
beautiful, work is great and food is excellent. Well OK, I do miss some
things from Spain and weather is horrible, but from now on it can only
get better, I suppose.</p>
Last days in Granada2009-12-15T23:42:00+01:002009-12-15T23:42:00+01:00Fabian Pedregosatag:fa.bianp.net,2009-12-15:/blog/2009/last-days-in-granada/<p>Nice thing about winter in Granada is, that even in the coldest days,
the sky is always blue.</p>
<img alt="" src="http://farm3.static.flickr.com/2539/4180997669_aa5a45a949_m.jpg" />
Learning, Machine Learning2009-12-15T23:34:00+01:002009-12-15T23:34:00+01:00Fabian Pedregosatag:fa.bianp.net,2009-12-15:/blog/2009/learning-machine-learning/<p>My new job is about managing an open source package for machine learning
in Python. I've had some experience with Python now, but I am a total
newbie in the field of machine learning, so my first task will be to
find a good reference book in the subject and …</p><p>My new job is about managing an open source package for machine learning
in Python. I've had some experience with Python now, but I am a total
newbie in the field of machine learning, so my first task will be to
find a good reference book in the subject and start reading. The books
I've come across so far have been: The classic "Artificial Intelligence,
A Modern Approach" by Rusell & Norvig, also known by its initials AIMA
has been a very valuable introduction. It has four chapters devoted to
Machine Learning, and while it does not go very deeply into the maths
nor provide detailed analysis of the algorithm, it can be read fairly
easily and has very interesting historical notes at the end of each
chapter. "Data Mining, Practical Machine Learning Tools and Techniques"
by Ian H. Witten and Eibe Frank. It is easy going, a bit more in-depth
but does not go very deep into the maths (and when it does, the section
is marked as 'optional'). The second part of the book, called "The Weka
Machine Learning Workbench" describes Weka, a machine learning framework
written in Java. For me, this is an invaluable resource, as Weka seems
to be fairly well designed and will surely provide some design
inspiration. "Pattern Classification", by Richard O Duda, Peter E Hart,
David G Stork. A bit harsh at the beginning, this book is slowly
becoming my favorite. Doesn't hide the maths behind the algorithms, has
complete sections on computational complexity, lots of exercises at the
end of each chapter ... I just wish it came with answers to (selected)
exercises. Interestingly, this books starts describing the most specific
methods and finishes with the most general ones. In AIMA, you'll find
the reverse scheme.</p>
Moving to Paris!2009-12-12T02:07:00+01:002009-12-12T02:07:00+01:00Fabian Pedregosatag:fa.bianp.net,2009-12-12:/blog/2009/moving-to-paris/<p>I'm extremely glad that finally I am moving to Paris to work as part of
the INRIA crew. I'll be working with <a class="reference external" href="http://gael-varoquaux.info/">Gael Varoquaux</a> and his team in
an extremely cool Python related project (more to come on this in the
following weeks). Granada has been a great place for …</p><p>I'm extremely glad that finally I am moving to Paris to work as part of
the INRIA crew. I'll be working with <a class="reference external" href="http://gael-varoquaux.info/">Gael Varoquaux</a> and his team in
an extremely cool Python related project (more to come on this in the
following weeks). Granada has been a great place for me on the last 10
years, but now I feel its time to move, and hey, Paris is not a bad
place :-)</p>
Summer of Code is over2009-09-05T12:21:00+02:002009-09-05T12:21:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-09-05:/blog/2009/summer-of-code-is-over/<p>Google Summer of Code program is officially over. It has been four
months of intense work, exciting benchmarks and patch reviewing. It was
a huge pleasure working with you guys! As for the project, I implemented
a complete logic module and then an assumption system for sympy
(sympy.logic, sympy …</p><p>Google Summer of Code program is officially over. It has been four
months of intense work, exciting benchmarks and patch reviewing. It was
a huge pleasure working with you guys! As for the project, I implemented
a complete logic module and then an assumption system for sympy
(sympy.logic, sympy.assumptions, sympy.queries). I even had time to make
the logic module fast. On top of this, there's the refine module. It is
there where you can see some nice examples and where all the power of
sympy.queries and sympy.logic is exposed. Although this sounds good,
there are some things that I did not complete on time. I could not
remove the old assumption system. There are simply too many things that
depend on this to remove it on one move. However, I agreed with Ondrej
that we both would be working on this the days 15-30 September. This has
to be done because we definitely do not want to make a sympy release
with two different assumption systems! PD: a more detailed report lives
<a class="reference external" href="http://code.google.com/p/sympy/wiki/AssumptionsReport">here</a></p>
Speed improvements for ask() (sympy.queries.ask)2009-08-20T00:36:00+02:002009-08-20T00:36:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-08-20:/blog/2009/speed-improvements-for-ask-sympyqueriesask/<p>I managed to overcome the overhead in ask() that arises when converting
between symbol and integer representation of sentences in conjunctive
normal. The result went beyond what I expected. The test suite for the
query module got 10x times faster in my laptop. From 26 seconds, it
descended to an …</p><p>I managed to overcome the overhead in ask() that arises when converting
between symbol and integer representation of sentences in conjunctive
normal. The result went beyond what I expected. The test suite for the
query module got 10x times faster in my laptop. From 26 seconds, it
descended to an impressive 2.03 secs. There is still room for
improvement, but it is no longer "so desperately slow". I'll submit
those patches soon to sympy's trunk, but in the meantime they are in my
logic branch: <tt class="docutils literal">git pull <span class="pre">http://fa.bianp.net/git/sympy.git</span> logic</tt></p>
Logic module (sympy.logic): improving speed2009-08-18T23:35:00+02:002009-08-18T23:35:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-08-18:/blog/2009/logic-module-sympylogic-improving-speed/<p>Today I've been doing some speed improvements for the logic module. More
precisely, I implemented an efficient internal representation for
clauses in conjunctive normal form. In practice this means a huge
performance boost for all problems that make use the function
satisfiable() or dpll_satisfiable(). For example, test_dimacs.py has
moved …</p><p>Today I've been doing some speed improvements for the logic module. More
precisely, I implemented an efficient internal representation for
clauses in conjunctive normal form. In practice this means a huge
performance boost for all problems that make use the function
satisfiable() or dpll_satisfiable(). For example, test_dimacs.py has
moved from 2.7 seconds to an impressive 0.3 sec, and ask() runs on
average 3x times faster, although both problems still have an overhead
because of the conversion to this new representation that can be avoided
in most times. Now, the details. Traditionally, dpll (the algorithm that
we use for deciding satisfiability) used to store clauses as arrays of
symbols, and this worked fine, but sadly comparing symbols in sympy is
slow, and this algorithm does a lot of comparisons ... but we can map
each sympy symbol to a unique integer, and with minor modifications to
the algorithm we get these performance gains. Now, the code. You can
pull from my branch logic:
<tt class="docutils literal">git pull <span class="pre">http://fa.bianp.net/git/sympy.git</span> logic</tt> There are now some
obvious performance tweaks we can do: - in ask(), we can skip the
conversion to integer representation by 'precompiling'
known_facts_dict into this representation. This should be easy and
will probably give performance boosts of several orders of magnitude. -
this integer representation is very similar to the one used in dimacs
CNF files, so a parser that directly converts CNF files to this integer
representation should make solving CNF files much faster. --- I would
like to give some credit to Ronan Lamy, who sent a patch some time ago,
and although I did not include it (yet) into main sympy branch, it
inspired me for these modifications.</p>
Refine module2009-08-17T19:20:00+02:002009-08-17T19:20:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-08-17:/blog/2009/refine-module/<p><a class="reference external" href="http://git.sympy.org/?p=sympy.git;a=commit;h=dd679c2751ac0900c47302fd6187ae9eea60918f">This</a> commit introduced a new module in sympy: the refine module. The
purpose of this module is to simplify expressions when they are bound to
assumptions. For example, if you know that x>0, then you can simplify
abs(x) to x. This code was traditionally embedded into the core …</p><p><a class="reference external" href="http://git.sympy.org/?p=sympy.git;a=commit;h=dd679c2751ac0900c47302fd6187ae9eea60918f">This</a> commit introduced a new module in sympy: the refine module. The
purpose of this module is to simplify expressions when they are bound to
assumptions. For example, if you know that x>0, then you can simplify
abs(x) to x. This code was traditionally embedded into the core, but now
this will be part of an external module (sympy.refine) upon which the
core has no dependencies. In a not very original move, I named the main
function in this module refine(). It's syntax is very straightforward:
first argument is an expression and second argument are assumptions.
Some examples (from isympy): In [1]: refine(1+abs(x), Assume(x,
Q.positive)) Out[1]: 1 + x In [2]: refine(exp(I*x*pi), Assume(x,
Q.odd)) Out[2]: -1 In [3]: refine(exp(I*x*pi), Assume(x, Q.even))
Out[3]: 1 Right now the module lacks some rules, but the design (very
similar to the query module) will make adding these rules an easy task.</p>
Query module - finally in trunk2009-08-10T21:51:00+02:002009-08-10T21:51:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-08-10:/blog/2009/query-module-finally-in-trunk/<p>The query module is finally in the main SymPy repository. I made
substantial changes since last post, most of them at the user interface
level (thanks to Vinzent and Mateusz for many insightful comments). Main
function is ask(), which replaces the old expression.is_* syntax. You
can ask many things …</p><p>The query module is finally in the main SymPy repository. I made
substantial changes since last post, most of them at the user interface
level (thanks to Vinzent and Mateusz for many insightful comments). Main
function is ask(), which replaces the old expression.is_* syntax. You
can ask many things. For example, you can ask if a given expression is
integer, prime or real: [cc lang="python"] >>> ask(2, Q.integer) True
>>> ask(x**2, Q.integer) None >>> ask(x**2, Q.integer, Assume(x,
Q.integer)) True >>> ask(sqrt(2)*x, Q.integer, Assume(x, Q.integer))
>>> ask(I*x, Q.real, Assume(x, Q.imaginary)) True >>> ask(x*y,
Q.prime, Assume(x, Q.prime) & Assume(y, Q.prime)) False [/cc] as you
see, it returns True when it is sure that the expression is an integer,
None if it does not know, and False if it is certainly not an integer.
The second argument, which we will call 'key' specifies what we want to
ask about. For example, Q.integer ask wether expression is an integer,
Q.negative wether it's a negative, etc. For a complete list of all the
keys available, see doc/src/modules/queries.txt in sympy codebase. It
also accepts an optional third argument where you can specify
assumptions that symbols in expr satisfy. That can be any kind of
boolean expression involving assumptions, for example:
<tt class="docutils literal">Assume(x, Q.positive) & Assume(x, Q.integer)</tt> or
<tt class="docutils literal">Assume(x, Q.positive) | Assume(x, Q.negative)</tt>, or even
<tt class="docutils literal">NAnd(Assume(x, Q.positive), Assume(x, Q.integer)</tt></p>
django, change language settings dynamically2009-08-07T16:12:00+02:002009-08-07T16:12:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-08-07:/blog/2009/django-change-language-settings-dynamically/<p>After some failed attempts, I just found how to change the language
settings dynamically in django, and I thought it could be useful to
someone. Just use function activate() from django.utils.translation. For
example: [cc lang="python"] from django.utils.translation import
activate activate('es-ES') [/cc] will change global …</p><p>After some failed attempts, I just found how to change the language
settings dynamically in django, and I thought it could be useful to
someone. Just use function activate() from django.utils.translation. For
example: [cc lang="python"] from django.utils.translation import
activate activate('es-ES') [/cc] will change global language settings to
'es-ES' (Spain spanish). I use it because I have two forms and I want
that errors appear in different languages, but I do not want to get into
gettext</p>
can we merge now, pleeease ?2009-07-21T22:36:00+02:002009-07-21T22:36:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-07-21:/blog/2009/can-we-merge-now-pleeease/<p>Three months after I began to write sympy.queries, I feel it's about
time to include it in sympy's trunk, so today I sent for review <a class="reference external" href="http://groups.google.com/group/sympy-patches/browse_thread/thread/76dcdfd0994a1c81">4
patches that implement the complete query module</a>. It's been a lot of
fun, but it has also caused me some headaches ... specially last …</p><p>Three months after I began to write sympy.queries, I feel it's about
time to include it in sympy's trunk, so today I sent for review <a class="reference external" href="http://groups.google.com/group/sympy-patches/browse_thread/thread/76dcdfd0994a1c81">4
patches that implement the complete query module</a>. It's been a lot of
fun, but it has also caused me some headaches ... specially last month
trying to have the code as much bug-free as possible. Next steps is to
improve performace of query() and write the refine module. PD: This
weekend I'll be at Leipzig for EuroSciPy, hope to meet there some SymPy
developers!</p>
Refine module, proof of concept2009-07-09T02:49:00+02:002009-07-09T02:49:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-07-09:/blog/2009/refine-module-proof-of-concept/<p>The 0.6.5 release of SymPy is taking longer than expected because <a class="reference external" href="http://code.google.com/p/sympy/issues/detail?id=1521">some
bugs in the testing framework</a>, so my query module is not merged into
trunk (yet). In the meantime, I am implementing a refine module (very
little code is available yet). The refine module implements a refine …</p><p>The 0.6.5 release of SymPy is taking longer than expected because <a class="reference external" href="http://code.google.com/p/sympy/issues/detail?id=1521">some
bugs in the testing framework</a>, so my query module is not merged into
trunk (yet). In the meantime, I am implementing a refine module (very
little code is available yet). The refine module implements a refine()
function (better names accepted) that would work in a very similar way
as Mathematica's Refine
(<a class="reference external" href="http://documents.wolfram.com/mathematica/functions/Refine">http://documents.wolfram.com/mathematica/functions/Refine</a>). It tries to
simplify an expression based on it's assumptions. For example: [cc
lang="python"] >>> refine(abs(x), Assume(x, positive=True)) x >>>
refine(abs(x), Assume(x, negative=True)) -x [/cc] Initial code is in my
git repo, branch queries.</p>
Preparing a new release2009-06-30T08:57:00+02:002009-06-30T08:57:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-30:/blog/2009/preparing-a-new-release/<p>Last days I've been busy preparing <a class="reference external" href="http://groups.google.com/group/sympy/browse_thread/thread/88474cde3bc6e350#">the first public beta of SymPy
0.6.5</a>. Most of the time was spent solving a bug that made
documentation tests fail under python2.4, but now that this is solved, I
hope that by the end of the week we could have …</p><p>Last days I've been busy preparing <a class="reference external" href="http://groups.google.com/group/sympy/browse_thread/thread/88474cde3bc6e350#">the first public beta of SymPy
0.6.5</a>. Most of the time was spent solving a bug that made
documentation tests fail under python2.4, but now that this is solved, I
hope that by the end of the week we could have a final release. When
this release is published, we'll merge my query module and work on
getting it right for 0.7.</p>
Efficient DPLL algorithm2009-06-28T18:16:00+02:002009-06-28T18:16:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-28:/blog/2009/efficient-dpll-algorithm/<p>Background: DPLL is the algorithm behind SymPy's implementation of
logic.inference.satisfiable After reading the original papers by Davis &
Putnam [1], I managed to implement a more efficient version of the DPLL
algorithm. It is 10x times faster on medium-sized problems (40
variables), and solves some wrong result bugs [2 …</p><p>Background: DPLL is the algorithm behind SymPy's implementation of
logic.inference.satisfiable After reading the original papers by Davis &
Putnam [1], I managed to implement a more efficient version of the DPLL
algorithm. It is 10x times faster on medium-sized problems (40
variables), and solves some wrong result bugs [2]. As a side effect, the
query module has become 2x faster Source code lives in my sympy repo,
<a class="reference external" href="http:/fa.bianp.net/git/sympy.git">http:/fa.bianp.net/git/sympy.git</a>, branch logic References: [1]:
<a class="reference external" href="http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=321034">http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=321034</a> [2]
<a class="reference external" href="http://people.sc.fsu.edu/~burkardt/data/cnf/dubois22.cnf">http://people.sc.fsu.edu/~burkardt/data/cnf/dubois22.cnf</a></p>
Queries and performance2009-06-23T00:02:00+02:002009-06-23T00:02:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-23:/blog/2009/queries-and-performance/<p>After some hacking on the queries module, I finally got it right without
the <a class="reference external" href="http://fa.bianp.net/blog/?p=149">limitations of past versions</a>. You can check it out from my repo
<a class="reference external" href="http://fa.bianp.net/git/sympy.git">http://fa.bianp.net/git/sympy.git</a>, branch master. It now relies even more
on logic.inference.satisfiable(), which is just an implementation of …</p><p>After some hacking on the queries module, I finally got it right without
the <a class="reference external" href="http://fa.bianp.net/blog/?p=149">limitations of past versions</a>. You can check it out from my repo
<a class="reference external" href="http://fa.bianp.net/git/sympy.git">http://fa.bianp.net/git/sympy.git</a>, branch master. It now relies even more
on logic.inference.satisfiable(), which is just an implementation of the
<a class="reference external" href="http://en.wikipedia.org/wiki/DPLL_algorithm">DPLL algorithm</a>. Bad news is that (my implementation of )
dpll_satisfiable() is SLOW, so inevitably queries are SLOW. But
everything is not lost, since the algorithm is quite fast, and in fact
other variants of the algorithm (MiniSAT) perform 6600x times faster
than my implementation on medium-sized problems (60 variables, 170
clauses). So this looks like something smells bad on the programming
side ... However, I spent the day profiling the function <a class="reference external" href="http://fa.bianp.net/blog/wp-content/uploads/2009/06/profilepy.zip">(link to
source code used for profiling)</a> without much success</p>
Reading CNF files2009-06-20T16:27:00+02:002009-06-20T16:27:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-20:/blog/2009/reading-cnf-files/<p><p>The DIMACS CNF file format is used to define a Boolean expression,
written in conjunctive normal form, that may be used as an example of
the satisfiability problem. The new logic module (sympy.logic) can read
the content of a cnf file and transform it into a boolean expression
suitable …</p></p><p><p>The DIMACS CNF file format is used to define a Boolean expression,
written in conjunctive normal form, that may be used as an example of
the satisfiability problem. The new logic module (sympy.logic) can read
the content of a cnf file and transform it into a boolean expression
suitable for use in other methods. For example, let quinn.cnf be a file
with the following content:</p>
<pre class="literal-block">
c an example from Quinn's text, 16 variables and 18 clauses.
c Resolution: SATISFIABLE
c
p cnf 16 18
1 2 0
-2 -4 0
3 4 0
-4 -5 0
5 -6 0
6 -7 0
6 7 0
7 -16 0
8 -9 0
-8 -14 0
9 10 0
9 -10 0
-10 -11 0
10 12 0
11 12 0
13 14 0
14 -15 0
15 16 0
</pre>
<p>Then we can load the file and test for satisfiability:</p>
<pre class="literal-block">
In [1]: from sympy.logic.utilities.dimacs import load_file
In [2]: expr = load_file("quinn.cnf")
In [3]: from sympy.logic.inference import satisfiable
In [4]: satisfiable(expr)
Out[4]:
{cnf_1: True, cnf_11: ...
References: more on the DIMACS CNF file format
BUGS: For large files like this one it exits prematurely with an error in the DPLL algorithm.
</pre>
Logic module merged2009-06-19T12:23:00+02:002009-06-19T12:23:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-19:/blog/2009/logic-module-merged/<p>Yesterday I finally merged the logic module in sympy's official master
branch, and should be released together with SymPy 0.6.5. Next thing to
do: profile the code and write some docs before the release.</p>
The boolean satisfiability problem2009-06-15T06:00:00+02:002009-06-15T06:00:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-15:/blog/2009/the-boolean-satisfiability-problem/<p><p>Most annoying problem in my implementation of the query system is that
it will not solve implications if the implicates are far away from each
other. For instance, if the graph of known facts is something like this</p>
<pre class="literal-block">
Integer ----> Rational --> Real --> Complex
^ ^
| |
| -------
| |
Prime Even
^
|
|
MersennePrime
</pre>
<p>Then it will not know …</p></p><p><p>Most annoying problem in my implementation of the query system is that
it will not solve implications if the implicates are far away from each
other. For instance, if the graph of known facts is something like this</p>
<pre class="literal-block">
Integer ----> Rational --> Real --> Complex
^ ^
| |
| -------
| |
Prime Even
^
|
|
MersennePrime
</pre>
<p>Then it will not know how to handle the query: Is x complex assuming it
is a Mersenne prime ?. This is because the vertices MersennePrime and
Complex are far away from each other and the query function does not
load the complete graph of known facts, but rather a small subgraph
centered on the assumed facts ... This was done so for efficiency
reasons, because in the initial implementation I feared that the graph
of known facts could become huge and thus making it unfeasible to search
into. But things have changed now. Known facts is not huge at all,
roughly having over 20 vertices, so it is feasible to build the complete
graph the first time query() is called and store it for future uses.
And, most important, we have implemented fast algorithms for the
<a class="reference external" href="http://en.wikipedia.org/wiki/Boolean_satisfiability_problem">problem of boolean satisfiability</a> (<a class="reference external" href="http://en.wikipedia.org/wiki/DPLL_algorithm">DPLL</a> under
sympy.logic.algorithms.dpll), so all is ready to implement these ideas
in the following days. Interestingly, there seems to me many open source
libraries for solving this problem. One that caught my attention early
is <a class="reference external" href="http://minisat.se/MiniSat.html">MiniSAT</a>, a nice little program written in C++ which is really fast</p>
</p>Initial implementation of the query system2009-06-12T06:36:00+02:002009-06-12T06:36:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-12:/blog/2009/initial-implementation-of-the-query-system/<p>I sent <a class="reference external" href="http://groups.google.com/group/sympy-patches/browse_thread/thread/e56ceda0038b7c23">some patches</a> to sympy-patches with an initial implementation
of the query system. You can check it out by pulling from my branch:
<tt class="docutils literal">git pull <span class="pre">http://fa.bianp.net/git/sympy.git</span> master</tt> into your sympy
repo. Some examples of what you can do (sample isympy session):
<tt class="docutils literal">In [1 …</tt></p><p>I sent <a class="reference external" href="http://groups.google.com/group/sympy-patches/browse_thread/thread/e56ceda0038b7c23">some patches</a> to sympy-patches with an initial implementation
of the query system. You can check it out by pulling from my branch:
<tt class="docutils literal">git pull <span class="pre">http://fa.bianp.net/git/sympy.git</span> master</tt> into your sympy
repo. Some examples of what you can do (sample isympy session):
<tt class="docutils literal">In [1]: query(x, positive=True)</tt> Returns None, as we do not know
whether x is positive or not.
`` In [2]: query(abs(x), positive=True) Out[2]: True`` because abs() is
always positive. Because exp() is always positive, the following should
also be True: `` In [3]: query(exp(x), positive=True)`` but why then
does it return None ?. Well, it simply is not True that exp() is always
positive, it is always positive for real values, but SymPy does not
assume that x is real, so you would have to specify that. This is now
done with the keyword assumptions:
`` In [5]: query(exp(x), positive=True, assumptions=Assume(x, real=True)) Out[5]: True``
As you can see, now assumptions are independent objects and are not tied
to symbols any more. For more examples, see the file
sympy/query/tests/test_query.py still in the TODO list: - support for
global assumptions - solve more complex implications, like
<tt class="docutils literal">query(x, positive=True, assumptions=Assume(x, even=True))</tt> where it
should build the chain of implications even => integer => rational =>
real. This chain of implications currently stops at rational for
efficiency reasons, because the number of facts grows in each step which
makes the number of possible paths grow exponentially.</p>
Assumption system and automatic theorem proving. Should I be learning LISP ?2009-06-03T13:50:00+02:002009-06-03T13:50:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-03:/blog/2009/assumption-system-and-automatic-theorem-proving-should-i-be-learning-lisp/<p>This is the third time I attempt to write the assumption system. Other
attempts could be described as me following the rule: “For any complex
problem, there is always a solution that is simple, clear, and wrong.”
My <a class="reference external" href="http://groups.google.com/group/sympy-patches/browse_thread/thread/b6fd5402e729f58/8006779044c41a17?lnk=gst&q=fabian+assumptions#8006779044c41a17">first attempt</a> (although better than the current assumption system)
did use very …</p><p>This is the third time I attempt to write the assumption system. Other
attempts could be described as me following the rule: “For any complex
problem, there is always a solution that is simple, clear, and wrong.”
My <a class="reference external" href="http://groups.google.com/group/sympy-patches/browse_thread/thread/b6fd5402e729f58/8006779044c41a17?lnk=gst&q=fabian+assumptions#8006779044c41a17">first attempt</a> (although better than the current assumption system)
did use very rudimentary logic and was not very smart. It could infer
some basic rules, like Integer => Rational, but could not construct long
paths like Prime => Integer => Rational => Real => Complex. You would
have to specify by hand that prime numbers are also complex ... and this
just what makes the old assumption system unmanageable. I know that
state-of-the art CAS like Mathematica use advanced resolution techniques
to get <a class="reference external" href="http://reference.wolfram.com/mathematica/tutorial/UsingAssumptions.html">smart results</a>. And in the open source world, well, the only
CAS that I could find that has this sort of algorithms in Maxima, but in
order to understand that I would have to learn LISP, and ironically
enough I started to contribute to SymPy because I didn't feel like
learning LISP ...</p>
Homenaje a Antonio Vega en La Percha2009-06-01T15:34:00+02:002009-06-01T15:34:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-06-01:/blog/2009/homenaje-a-antonio-vega-en-la-percha/<p>El pasado jueves estuvimos en La Percha tocando algunas canciones de
Antonio Vega. El vídeo se lo ha currado <a class="reference external" href="http://retrovisor.net">mi padre</a> mezclando el sonido
del directo con una grabación que hicimos en casa de Migue</p>
<p><a class="reference external" href="http://vimeo.com/4926476">LOS ESCLAVOS: homenaje a Antonio Vega</a> from <a class="reference external" href="http://vimeo.com/user938253">Felipe Pedregosa</a> on
<a class="reference external" href="http://vimeo.com">Vimeo</a>.</p>
</p>Fun with the new Logic module2009-05-31T23:16:00+02:002009-05-31T23:16:00+02:00Fabian Pedregosatag:fa.bianp.net,2009-05-31:/blog/2009/fun-with-the-new-logic-module/<p><p>The logic module is slowly becoming useful. This week I managed to get
some basic inference in propositional logic working. This should be
enough for the assumption sysmtem (although having first-order inference
would be cool). You can pull from my branch:
`` git pull <a class="reference external" href="http://fa.bianp.net/git/sympy.git">http://fa.bianp.net/git/sympy.git …</a></p></p><p><p>The logic module is slowly becoming useful. This week I managed to get
some basic inference in propositional logic working. This should be
enough for the assumption sysmtem (although having first-order inference
would be cool). You can pull from my branch:
`` git pull <a class="reference external" href="http://fa.bianp.net/git/sympy.git">http://fa.bianp.net/git/sympy.git</a> logic`` Here are some
examples of what it can do: First, importing and defining our symbols</p>
<pre class="literal-block">
In [1]: A, B, C = symbols('ABC')
In [2]: from sympy.logic import *
</pre>
<p>It works with Symbols just as you would expect</p>
<pre class="literal-block">
In [3]: And(A, B)
Out[3]: And(A, B)
</pre>
<p>It applies De Morgan Rules automatically</p>
<pre class="literal-block">
In [4]: Not(Or(A, B))
Out[4]: And(Not(A), Not(B))
</pre>
<p>converts to conjuntive normal form (CNF)</p>
<pre class="literal-block">
In [5]: to_cnf(Implies(A, And(B, C)))
Out[5]: And(Or(B, Not(A)), Or(C, Not(A)))
</pre>
<p>Some basic inference:</p>
<pre class="literal-block">
In [6]: pl_true( Or(A, B), {A : True}) # what can we say about Or(A, B) if A is True ?
Out[6]: True
In [7]: pl_true ( And(A, B), {B: False}) # what is And(A, B) if B is False
Out[7]: False
</pre>
<p>To be discussed: - I'm not sure if we should override &&, || on Symbol
so that we can do A && B instead of And(A, B). If would make the code
cleaner, but also I don't want to bloat Symbol any more. What do you
think ? I'm very proud of this in the sense that it is a nice clean
module that will hopefully serve as the foundation of the new assumption
system.</p>
</p>