Here are my most recent publications in reverse chronological order. A similar list can also be found in my Google Scholar profile.

Worst-case convergence rates for the function suboptimality after $t$ iterations, for a dataset with $n$ samples
**Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization**, Geoffrey Négiar, Gideon Dresdner, Alicia Tsai, Laurent El Ghaoui, Francesco Locatello, Fabian Pedregosa, 2020. *Under review*. ArXiv.

**Abstract**:
We propose a novel Stochastic Frank-Wolfe (a.k.a. Conditional Gradient) algorithm with a fixed batch size tailored to the constrained optimization of a finite sum of smooth objectives. The design of our method hinges on a primal-dual interpretation of the Frank-Wolfe algorithm.
Recent work to design stochastic variants of the Frank-Wolfe algorithm falls into two categories: algorithms with increasing batch size, and algorithms with a given, constant, batch size. The former have faster convergence rates but are impractical; the latter are practical but slower. The proposed method combines the advantages of both: it converges for unit batch size, and has faster theoretical worst-case rates than previous unit batch size algorithms. Our experiments also show faster empirical convergence than previous unit batch size methods for several tasks.

**Universal Average-Case Optimality of Polyak Momentum**, Damien Scieur, Fabian Pedregosa, 2020. *Under review*. ArXiv.

**Abstract**:
We consider the average-case runtime analysis of algorithms for minimizing quadratic objectives. In this setting, and contrary to the more classical worst-case analysis, non-asymptotic convergence rates and optimal algorithms depend on the full spectrum of the Hessian through its expected spectral distribution. Under mild assumptions, we show that these optimal methods converge asymptotically towards Polyak momentum independently of the expected spectral density. This makes Polyak momentum universally (i.e., independent of the spectral distribution) asymptotically average-case optimal.

A worst-case analysis can lead to misleading results
where the worst-case running times is much worse than the observed running time. Above: convergence of individual random
square least squares in gray, while the average suboptimality (orange) is well below the worst-case (blue)
**Average-case Acceleration Through Spectral Density Estimation**, Fabian Pedregosa, Damien Scieur, 2020. *Under review*. ArXiv.

**Abstract**: We develop a framework for designing optimal quadratic optimization methods in terms of their average-case runtime. This yields a new class of methods that achieve acceleration through a model of the Hessian's expected spectral density. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based gradient algorithms whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Empirical results on quadratic, logistic regression and neural networks show the proposed methods always match and in many cases significantly improve over classical accelerated methods.

The proposed sufficient decrease condition ensures that the quadratic approximation is an upper bound at its constrained minimum of the line-search objective.
**Linearly Convergent Frank-Wolfe with Backtracking Line-Search**, Fabian Pedregosa, Geoffrey Negiar, Armin Askari, Martin Jaggi, 2020. *Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics*. ArXiv, Slides.

**Abstract**: Structured constraints in Machine Learning have recently brought the Frank-Wolfe (FW) family of algorithms back in the spotlight. While the classical FW algorithm has poor local convergence properties, the Away-steps and Pairwise FW variants have emerged as improved variants with faster convergence. However, these improved variants suffer from two practical limitations: they require at each iteration to solve a 1-dimensional minimization problem to set the step-size and also require the Frank-Wolfe linear subproblems to be solved exactly. In this paper, we propose variants of Away-steps and Pairwise FW that lift both restrictions simultaneously. The proposed methods set the step-size based on a sufficient decrease condition, and do not require prior knowledge of the objective. Furthermore, they inherit all the favorable convergence properties of the exact line-search version, including linear convergence for strongly convex functions over polytopes. Benchmarks on different machine learning problems illustrate large performance gains of the proposed variants.

**Information matrices and generalization**, Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Mangazol, Yoshua Bengio, Nicolas Le Roux, 2020. *Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics*

**Abstract**: This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.

Major milestones from SciPy’s initial release in 2001 to the release of SciPy 1.0 in 2017.
**SciPy 1.0--Fundamental Algorithms for Scientific Computing in Python**, Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, SciPy 1.0 Contributors ArXiv.

**Abstract**: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

We find a monotonically decreasing path between different initializations and a set of "good" solutions found through pruning.
**The Difficulty of Training Sparse Neural Networks**, Utku Evci, Fabian Pedregosa, Aidan Gomez, Erich Elsen, 2019. ArXiv

**Abstract**: This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.

The proposed method combines the advantages of variance-reduced methods (incremental updates, non-decreasing step sizes and sparse updates) with those of proximal splitting methods (support for a sum of non-smooth terms).
**Proximal Splitting Meets Variance Reduction**, Fabian Pedregosa, Kilian Fatras, Mattia Casotto, 2018. *Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics*. Proceedings version.

**Abstract**: Despite the rise to fame of incremental variance-reduced methods in recent years, their use in nonsmooth optimization is still limited to few simple cases. This is due to the fact that existing methods require to evaluate the proximity operator for the nonsmooth terms, which can be a costly operation for complex penalties. In this work we introduce two variance-reduced incremental methods based on SAGA and SVRG that can efficiently take into account complex penalties which can be expressed as a sum of proximal terms. This includes penalties such as total variation, group lasso with overlap and trend filtering, to name a few. Furthermore, we also develop sparse variants of the proposed algorithms which can take advantage of sparsity in the input data. Like other incremental methods, it only requires to evaluate the gradient of a single sample per iteration, and so is ideally suited for large scale applications. We provide a convergence rate analysis for the proposed methods and show that they converge with a fixed step-size, achieving in some cases the same asymptotic rate as their full gradient variants. Empirical benchmarks on 3 different datasets illustrate the practical advantages of the proposed methods.

In the proposed method, the candidate step size must verify a sufficient decrease condition, which can be interpreted as a quadratic upper bound on the objective function.
**Adaptive Three Operator Splitting**, Fabian Pedregosa, Gauthier Gidel, *Proceedings of the 35th International Conference on Machine Learning*, 2018. ICML, ArXiv, Slides, Poster, Video (starts at minute 52).

**Abstract**: We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.

We propose a step-size for the Frank-Wolfe and Matching Pursuit algorithms that only depends on local information of the objective. This allows to take larger step-sizes and converge faster.

**Frank-Wolfe Splitting via Augmented Lagrangian Method**, Gauthier Gidel, Fabian Pedregosa, Simon Lacoste-Julien, *Proceedings of the Twenty-First International Conference on Artficial Intelligence and Statistics*, 2018. PDF, PDF supplementary, Poster, Slides

**Abstract**: Minimizing a function over an intersection of convex sets is an important task in optimization that is often much more challenging than minimizing it over each individual constraint set. While traditional methods such as Frank-Wolfe (FW) or proximal gradient descent assume access to a linear or quadratic oracle on the intersection, splitting techniques take advantage of the structure of each sets, and only require access to the oracle on the individual constraints. In this work, we develop and analyze the Frank-Wolfe Augmented Lagrangian (FW-AL) algorithm, a method for minimizing a smooth function over convex compact sets related by a “linear consistency” constraint that only requires access to a linear minimization oracle over the individual constraints. It is based on the Augmented Lagrangian Method (ALM), also known as Method of Multipliers, but unlike most existing splitting methods, it only requires access to linear (instead of quadratic) minimization oracles. We use recent advances in the analysis of Frank-Wolfe and the alternating direction method of multipliers algorithms to prove a sublinear convergence rate for FW-AL over general convex compact sets and a linear convergence rate for polytopes.

The Randomized Away-steps FW variant that we propose (RAFW) is competitive with Away-steps FW (AFW), albeit with a much smaller per iteration cost.
**Frank-Wolfe with Subsampling Oracle**, Thomas Kerdreux, Fabian Pedregosa, Alexandre d'Aspremont, *Proceedings of the 35th International Conference on Machine Learning*, 2018. ICML, ArXiv

**Abstract**: We analyze two novel randomized variants of the Frank-Wolfe (FW) or conditional gradient algorithm. While classical FW algorithms require solving a linear minimization problem over the domain at each iteration, the proposed method only requires to solve a linear minimization problem over a small subset of the original domain. The first algorithm that we propose is a randomized variant of the original FW algorithm and achieves a (1/t) sublinear convergence rate as in the deterministic counterpart. The second algorithm is a randomized variant of the Away-step FW algorithm, and again as its deterministic counterpart, reaches linear (i.e., exponential) convergence rate making it the first provably convergent randomized variant of Away-step FW. In both cases, while subsampling reduces the convergence rate by a constant factor, the linear minimization step can be a fraction of the cost of that of the deterministic versions, especially when the data is streamed. We illustrate computational gains of the algorithms on regression problems, involving both L1 and latent group lasso penalties .

In this paper we highlight an issue present in large part of the literature of asynchronous stochastic optimization: under the traditional labeling scheme (i.e., how the iterates are named), gradient estimates might be biased. We propose a solution to this issue via a different labeling scheme.
**Improved asynchronous parallel optimization analysis for stochastic incremental methods**, Rémi Leblond, Fabian Pedregosa, Simon Lacoste-Julien, 2018. Journal of Machine Learning Research

**Abstract**: As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates.

ProxASAGA achieves significant speedups over its sequential
version while being orders of magnitude faster than competing methods.
**Breaking the Nonsmooth Barrier: A Scalable
Parallel Method for Composite Optimization**, Fabian Pedregosa, Rémi Leblond, Simon Lacoste-Julien. *Advances in Neural Information Processing Systems 30 (NIPS)*, 2017. Spotlight presentation (top 3% of submitted papers). NIPS, ArXiv, Github, BibTex, Video, Poster.

**Abstract**: Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

ASAGA is simple asynchronous algorithm that achieves state of the art performance and significant speedups on parallel processors. Above, a 5x speedup on 10 processors over its sequential version on two different datasets.
**ASAGA: Asynchronous Parallel SAGA**, Rémi Leblond, Fabian Pedregosa, Simon Lacoste-Julien. *Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 2017.
AISTATS ArXiv, Webpage, Github, Blog post.

**Abstract**: We describe ASAGA, an asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced “perturbed iterate” framework that resolves it. We thereby prove that ASAGA can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedup as well as the hardware overhead.

We characterize a the Fisher consistency of a rich family of loss functions commonly used in ordinal regression. Surprisingly, we discover that many such functions are consistent with respect to the 0-1 loss and not the absolute error.
**On the Consistency of Ordinal Regression Methods**, Fabian Pedregosa, Francis Bach, and Alexandre Gramfort, *Journal of Machine Learning Research*, 2017. JMLR, ArXiv, Blog post.

**Abstract:** Many of the ordinal regression models that have been proposed in the literature can be seen as methods that minimize a convex surrogate of the zero-one, absolute, or squared loss functions. A key property that allows to study the statistical implications of such approximations is that of *Fisher consistency*. Fisher consistency is a desirable property for surrogate loss functions and implies that in the population setting, i.e., if the probability distribution that generates the data were available, then optimization of the surrogate would yield the best possible model.
In this paper we will characterize the Fisher consistency of a rich family of surrogate loss functions used in the context of ordinal regression, including support vector ordinal regression, ORBoosting and least absolute deviation.

This paper describes SymPy, a Python library for symbolic computations. Among other things, it can compute the symbolic expression for derivatives, integrals or compute the exact solution of some polynomial equations (think of $\sqrt{2}$ vs $1.4142...$). Check out its website and comprehensive documentation for more information.
**SymPy: Symbolic computing in Python**, Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AmiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, Anthony Scopatz. *PeerJ Computer Science*, 2017. PeerJ.

**Abstract**: SymPy is an open source computer algebra system written in pure Python. It is built with a focus on extensibility and ease of use, through both interactive and programmatic applications. These characteristics have led SymPy to become a popular symbolic library for the scientific Python ecosystem. This paper presents the architecture of SymPy, a description of its features, and a discussion of select submodules. The supplementary material provide additional examples and further outline details of the architecture and features of SymPy.

An approximate gradient can be used to optimize a cross-validation loss with respect to hyperparameters. A decreasing bound between the true gradient and the approximate ensures the method converges towards a stationary point.
**Hyperparameter optimization with approximate gradient**, Fabian Pedregosa,
*Proceedings of The 33rd International Conference on Machine Learning, 2016*. ICML, ArXiv, Slides, Code, Blog post.

**Abstract**: Most models in machine learning contain at least one hyperparameter to control for model complexity. Choosing an appropriate set of hyperparameters is both crucial in terms of model accuracy and computationally challenging. In this work we propose an algorithm for the optimization of continuous hyperparameters using inexact gradient information. An advantage of this method is that hyperparameters can be updated before model parameters have fully converged. We also give sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors. Finally, we validate the empirical performance of this method on the estimation of regularization constants of L2-regularized logistic regression and kernel Ridge regression. Empirical benchmarks indicate that our approach is highly competitive with respect to state of the art methods.

How is the meaning of words instantiated in the brain?. Our results indicate that different aspects of word meaning are encoded in a distributed way
across different brain areas. **Word meaning in the ventral visual path: a perceptial to conceptual gradient of semantic coding.** Valentina Borghesani, Fabian Pedregosa, Marco Buiatti, Alexis Amadon, Evelyn Eger, and Manuela Piazza. *NeuroImage*, 2016.
PDF, NeuroImage, ResearchGate.

**Abstract**: The meaning of words referring to concrete items is thought of as a multidimensional representation that includes both perceptual (e.g., average size, prototypical color) and conceptual (e.g., taxonomic class) dimensions. Are these different dimensions coded in different brain regions? In healthy human subjects, we tested the presence of a mapping between the implied real object size (a perceptual dimension) and the taxonomic categories at different levels of specificity (conceptual dimensions) of a series of words, and the patterns of brain activity recorded with functional magnetic resonance imaging in six areas along the ventral occipito-temporal cortical path. Combining multivariate pattern classification and representational similarity analysis, we found that the real object size implied by a word appears to be primarily encoded in early visual regions, while the taxonomic category and sub-categorical cluster in more anterior temporal regions. This anteroposterior gradient of information content indicates that different areas along the ventral stream encode complementary dimensions of the semantic space.

This is a short technical report that simplifies some of the convergence proofs for the Davis-Yin three operator splitting method.
**On the convergence rate of the three operator splitting scheme**, Fabian Pedregosa. ArXiv:1610.07830

**Abstract**: The three operator splitting scheme was recently proposed by [Davis and Yin, 2015] as a method to optimize composite objective functions with one convex smooth term and two convex (possibly non-smooth) terms for which we have access to their proximity operator. In this short note we provide an alternative proof for the sublinear rate of convergence of this method.

One of the key aspects investigated in this thesis is the estimation of brain maps from a blood-oxygen-level dependent (BOLD) signal. One of the difficulties in this process stems from the fact that the BOLD signal does not
increase instantaneously after the stimuli onset nor does it return to baseline
immediately after the stimulus oset. Instead, it peaks
approximately 5 seconds after stimulation, and is followed by an undershoot
that lasts as long as 30 seconds.
**PhD thesis: Feature extraction and supervised learning on fMRI: from practice to theory**, defended February 2015 [PDF] [slides]

**Advisors**: Alexandre Gramfort (Telecom Paristech, Paris, France) and Francis Bach (INRIA / ENS, Paris, France).

**Reviewers**: Dimitri Van de Ville (Univ. Geneva / EPFL, Geneva, CH) and Alain Rakotomamonjy (University of Rouen, Rouen, France).

**Jury**: Marcel Van Gerven (Donders Instute, Nijmegen, NL), Ludovic Denoyer ( UPMC, Paris, France), Bertrand Thirion (INRIA / CEA, Saclay, France)

**Abstract**: Until the advent of non-invasive neuroimaging modalities the knowledge of the human brain came from the study of its lesions, post-mortem analyses and invasive experimentations. Nowadays, modern imaging techniques such as fMRI are revealing several aspects of the human brain with progressively high spatio-temporal resolution. However, in order to answer increasingly complex neuroscientific questions the technical improvements in acquisition must be matched with novel data analysis methods. In this thesis we examine different applications of machine learning to the processing of fMRI data. We propose novel extensions and investigate the theoretical properties of different models...

We describe a method to estimate the hemodynamical response function (HRF) from an fMRI signal. Our method allows to estimate a different HRF per voxel. The difference in the estimated HRFs suggests a substantial
variability at the voxel level within a single subject and a single task.
**Data-driven HRF estimation for encoding and decoding models**, Fabian Pedregosa, Michael Eickenberg, Philippe Ciuciu, Bertrand Thirion, and Alexandre Gramfort. *Neuroimage*, 2015. ArXiv, NeuroImage

**Abstract**: Despite the common usage of a canonical, data-independent, hemodynamic response function (HRF), it is known that the shape of the HRF varies across brain regions and subjects. This suggests that a data-driven estimation of this function could lead to more statistical power when modeling BOLD fMRI data. However, unconstrained estimation of the HRF can yield highly unstable results when the number of free parameters is large. We develop a method for the joint estimation of activation and HRF by means of a rank constraint, forcing the estimated HRF to be equal across events or experimental conditions, yet permitting it to differ across voxels. Model estimation leads to an optimization problem that we propose to solve with an efficient quasi-Newton method, exploiting fast gradient computations. This model, called GLM with Rank-1 constraint (R1-GLM), can be extended to the setting of GLM with separate designs which has been shown to improve decoding accuracy in brain activity decoding experiments. We compare 10 different HRF modeling methods in terms of encoding and decoding scores on two different datasets. Our results show that the R1-GLM model outperforms competing methods in both encoding and decoding settings, positioning it as an attractive method both from the points of view of accuracy and computational efficiency.

**Correlations of correlations are not reliable statistics: implications for multivariate pattern analysis**. Bertrand Thirion, Fabian Pedregosa, Michael Eickenberg, Gael Varoquaux, ICML Workshop on Statistics, Machine Learning and Neuroscience (Stamlins 2015) PDF

**A perceptual-to-conceptual gradient of word coding along the ventral path** Valentina Borghesani, Fabian Pedregosa, Evelyn Eger, Marco Buiatti,
Manuela Piazza. *4th International Workshop in Pattern Recognition and Neuroimaging*, 2014. [PDF]

**Machine learning for neuroimaging with scikit-learn**, Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Mueller, Jean Kossaifi, Alexandre Gramfort, Bertrand Thirion, Gaël Varoquaux. *Frontiers in Neuroinformatics*, 2014. [link]

**API design for machine learning software: experiences from the scikit-learn project**, Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vanderplas, Arnaud Joly, Brian Holt, Gaël Varoquaux. *European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases*, 2013. ArXiv

**HRF estimation improves sensitivity of fMRI encoding and decoding models**, Fabian Pedregosa, Michael Eickenberg, Bertrand Thirion, Alexandre Gramfort. *International Workshop on Pattern Recognition in Neuroimaging (PRNI)*, 2013. ArXiv

**Second order scattering descriptors predict fMRI activity due to visual textures**, Michael Eickenberg, Mehdi Senoussi, Fabian Pedregosa, Alexandre Gramfort, Bertrand Thirion. *International Workshop on Pattern Recognition in Neuroimaging (PRNI)*, 2013. ArXiv

**Learning to rank from medical imaging data**, Fabian Pedregosa et al. in Proc. 3rd Int. Work. Mach. Learn. Med. Imaging (2012). [ArXiv] ArXiv

**Scikit-learn : Machine Learning in Python**, Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay, *Journal of Machine Learning Research*, 2011. JMLR, PDF.

**Abstract**: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.org.

**Multi-subject dictionary learning to segment an atlas of brain spontaneous activity**, Gaël Varoquaux, Alexandre Gramfort, Fabian Pedregosa, Vincent Michel, Bertrand Thirion. [PDF]

**Abstract**: Fluctuations in brain on-going activity can be used to reveal
its intrinsic functional organization. To mine this information, we give
a new hierarchical probabilistic model for brain activity patterns that
does not require an experimental design to be specified. We estimate this
model in the dictionary learning framework, learning simultaneously latent
spatial maps and the corresponding brain activity time-series. Unlike
previous dictionary learning frameworks, we introduce an explicit
difference between subject-level spatial maps and their corresponding
population-level maps, forming an atlas. We give a novel algorithm using
convex optimization techniques to solve efficiently this problem with
non-smooth penalties well-suited to image denoising. We show on simulated
data that it can recover population-level maps as well as subject
specificities. On resting-state fMRI data, we extract the first atlas of
spontaneous brain activity and show how it defines a subject-specific
functional parcellation of the brain in localized regions.