Advisors | Francis Bach | INRIA/ENS, Paris, France. |
Alexandre Gramfort | Telecom Paristech, Paris, France. | |
Reviewers | Dimitri Van de Ville | Univ. Geneva / EPFL, Geneva, CH. |
Alain Rakotomamonjy | University of Rouen, Rouen, France. | |
Examiners | Ludovic Denoyer | UPMC, Paris, France. |
Marcel A.J. van Gerven | Donders Institute/Radboud Univ., NL. | |
Bertrand Thirion | INRIA/CEA, Saclay, France. |
Patients
Lesions
Post-mortem analysis
Finding
Post-mortem analysis of “Tan” [Paul Broca, 1861]
Until the 20th century most knowledge about the brain came from the study of lesions and post-mortem analysis.
Blood-oxygen-level dependent (BOLD) signal
Friederici, A. D. (2011) “The brain basis of language processing: from structure to function”. Physiological reviews
Nowadays, with the advent of non-invasive imaging techniques, notably that of functional MRI, it is possible to study brain function in healthy subjects with a much higher level of precision that was ever possible. The dominant paradigm in this setting is that of task-based fMRI, in which the subjects.
Blood-oxygen-level dependent (BOLD) signal
Answer increasingly complex cognitive questions.
E. Cauvet, “Traitement des structures syntaxiques dans le langage et dans la musique”, PhD thesis.
Today, one of the most used method to understand the neural substrate of cognitive functions is task-based fMRI. In this setting, in order to comfront a cognitive hypothesis, a subject is introduced in the scanner while presented some stimuli. During this time, a set of brain images are acquired. The signal acquired in these images is known as the Blood-oxygen-dependent (BOLD) signal, and represents the oxygen consumption at each unit of volume (voxel). These images follow a processing pipeline consisting of several steps before conclusions are drawn. The pipeline consists of several complex steps but also represents exciting oportunities to develop new and better data analysis methods.
Input: BOLD signal and an experimental paradigm.
Goal: output time-independent activation coefficients.
Delayed nature of the hemodynamic response (HRF).
The first step that we will analyze is a feature extraction step. In this step the input is the BOLD signal recorded by the fMRI scanner (which at this stage is still given by a sequence of scanners across time) together with an experimental paradigm. The goal is to transoform this signal into time-inpependent activation coefficients that represent the effect in the BOLD signal.
Contribution: decoding with ordinal values.
We will also focus on the application to supervised learning to fMRI datasets, in a setting known as decoding or reverse inference. In this context, the beta-maps computed in the feature extraction step are used to predict some information about the stimuli. The goal is to establish whether the given prediction is accurate enough to claim that the area encodes some information about the stimuli.
This talk is divided into 4 parts, corresponding roughly to different chapters of my manuscript
The first part presents a contribution to the step of feature extraction. I will detail a model, named R1-GLM for the joint estimation of HRF and activation coefficients.
In the second part I will move the the supervised learning step of the pipeline and present different methods that can be used to estimate a decoding model when target has a natural ordering.
where $X_i = \quad \quad~ \ast~ \text{s}_i(t)$
The LTI assumption allows to build a model in which the BOLD is expressed as a linear combination of different regressors multiplied bu the activation coefficients.
Each column of the design matrix contains the HRF convoluted with the stimulus
Hemodynamic response function (HRF) is known to vary substantially across subjects, brain regions and age.
D. Handwerker et al., “Variation of BOLD hemodynamic responses across subjects and brain regions and their effects on statistical analyses.,” Neuroimage 2004.
S. Badillo et al., “Group-level impacts of within- and between-subject hemodynamic variability in fMRI,” Neuroimage 2013.
Two basis-constrained models of the HRF: FIR and 3HRF.
In the GLM just introduced, the HRF is assumed to be known. However, it is known that the HRF varies substantially across
In this setting the HRF is assumed to be a linear combination elements in a set of basis functions.
Estimated as the solution to the optimization problem
$ \argmin_{\mathbf{h}, ~\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X} \text{vec}(\mathbf{h}\boldsymbol{\beta}^T)\|^2$
subject to $\|\ \mathbf{h}\| = 1$ and $\langle \mathbf{h}, \mathbf{h}_{\text{ref}} \rangle > 0$.
$\implies$ solved locally using quasi-Newton methods.
The resulting model is smooth (albeit non-convex) and can be solved using a variety of methods. The easiest to implement is to alternate the optimization of h and beta, where at each step we have a least squares subproblem. We found however that a joint minimization on both vectors is much faster using gradient-based methods.
Ground truth unknown. HRF often a nuisance parameter in neuroimaging studies. Since we don't have access to the ground truth we look for alternate measures that can assess. Why you validate like this ?
Cross-validation score in two different datasets
S. Tom et al., “The neural basis of loss aversion in decision-making under risk,” Science 2007.
K. N. Kay et al., “Identifying natural images from human brain activity.,” Nature 2008.
This talk is divided into 4 parts, corresponding roughly to different chapters of my manuscript
The first part presents a contribution to the step of feature extraction. I will detail a model, named R1-GLM for the joint estimation of HRF and activation coefficients.
In the second part I will move the the supervised learning step of the pipeline and present different methods that can be used to estimate a decoding model when target has a natural ordering.
Activation coefficients
Target values
Target values can be
How can we incorporate the order information into a decoding model ?
We now enter the second part of this presentation entitled decoding with ordinal labels. This is motivated by the need to analyze different fMRI datasets in which the target variable consist of ordinal values, as it arised within the work of different colleages at NeuroSpin.
The first dataset that we would like to analyze is a dataset in which the target variable consists of phrases with different levels of complexity.
Two relevant loss functions:
2 loss functions $\implies$ 2 aspects of the ordinal problem.
Since we are going to compare models, we first need to choose an evaluation metric, that is, the function that we will use to compare the different models.
Generalization of logistic regression.
Parameters ($p$ = num. dimensions, $k$ = num. classes)
Prediction
1 + $\argmin\{i: \mathbf{w}^T \mathbf{x} > \theta_i \}$
Surrogate loss function:
$\sum_{j=1}^{y-1} \text{logistic}(\theta_j-\mathbf{w}^T \mathbf{x}) + \sum_{j=y}^{k-1} \text{logistic}(\mathbf{w}^T \mathbf{x} - \theta_j)$
Parameters ($p$ = num. dimensions, $k$ = num. classes)
Prediction: $\text{sign} (\mathbf{w}^T \mathbf{x}_i - \mathbf{w}^T\mathbf{x}_j) = \text{sign} (\mathbf{w}^T(\mathbf{x}_i - \mathbf{x}_j))$
Surrogate loss function
$\text{logistic} (\text{sign}(y_i - y_j)(\mathbf{w}^T (\mathbf{x}_i - \mathbf{x}_j)))$
$\implies$ can be implemented as a binary-class logistic regression model, $\hat{y}_{ij} = y_i - y_j, \hat{\mathbf{x}}_{ij} = \mathbf{x}_i - \mathbf{x}_j$
This loss was first proposed within the context of information retrieval.
1) Predict level of complexity of a phrase (6 levels)
E. Cauvet, Traitement des structures syntaxiques dans le langage et dans la musique, PhD thesis
2) Predict number of letters in a word.
3) Predict real size of objects from noun.
V. Borghesani, “A perceptual-to-conceptual gradient of word coding along the ventral path,” PRNI 2014.
Top performers:
Top performer: RankLogistic
Application: role of language ROIs in complexity of sentences.
F. Pedregosa et al., “Learning to rank from medical imaging data,” MLMI 2012.
This talk is divided into 4 parts, corresponding roughly to different chapters of my manuscript
The first part presents a contribution to the step of feature extraction. I will detail a model, named R1-GLM for the joint estimation of HRF and activation coefficients.
In the second part I will move the the supervised learning step of the pipeline and present different methods that can be used to estimate a decoding model when target has a natural ordering.
Is there a reason that can explain this behaviour ?
Mean Absolute Error
So according to these studies it seems that that loss functions of the form b) perform better on the datasets they tried. Question: should we assume that because it performed better on some datasets b) is superior to a) ?.
The results that we will give in this section provide theoretical insight on this question.
Fisher consistency:
We will now define the supervised learning problem from a probabilistic setting. We will need this to give the definition of Fisher consistency (for surrogate loss functions).
Previous work
Fisher consistency is a condition on XXX.
Propose a formulation that parametrizes the ordinal regression surrogates seen so far
$$\psi^{\ell}(y, \boldsymbol{\alpha}) = \sum_{i = 1}^{y-1} \Delta \ell(y, i)\phi(\alpha_i) - \sum_{i=y}^{k-1}\Delta \ell(y, i)\phi(-\alpha_i)$$
where
$\psi^{\ell}(y, \boldsymbol{\alpha}) = \sum_{i = 1}^{y-1} \Delta \ell(y, i)\phi(\alpha_i) - \sum_{i=y}^{k-1}\Delta \ell(y, i)\phi(-\alpha_i)$
Consistency
$$\argmin_{\mathbf{h}, ~\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X} \text{vec}(\mathbf{h}\boldsymbol{\beta}^T)\|^2$$
Unused structure:
More realistic noise model.
While the R1-GLM performs much better than the alternative methods, it is my opinion that there is still margin for improvement.
For example, the current optimization algorithm solves many optimization problems of the form XXX and it does not use the information that the X is always the same.
Also, the model is frustratingly similar (but not equivalent!) to that of reduced rank regression [Andersen et al. XXX]
While the R1-GLM performs much better than the alternative methods, it is my opinion that there is still margin for improvement.
For example, the current optimization algorithm solves many optimization problems of the form XXX and it does not use the information that the X is always the same.
Also, the model is frustratingly similar (but not equivalent!) to that of reduced rank regression [Andersen et al.]
The model of the BOLD signal that is commonly used assumes a linear-time-invariant relationship between the BOLD signal and the neural response.
Source: [Poldrack et al. 2011]
As outlined in the introduction, the estimation of activation coefficients from BOLD signal is challenging due to the delayed nature of the hemodynamic response.
To make the identification problem well posed, it is common to rely on an assumption that relates the BOLD signal to neural activity and that has been empirically shown to hold under a wide range of circumtances. This is known as the linear-time-invariant assumption and is based on the following three points.
R1-GLM with separate designs: R1-GLMS
Optimization problem of the form
$ \sum_i^k \|\mathbf{y} - \beta_i \mathbf{X}^0_{\text{BS}i} \mathbf{h} - r_i \mathbf{X}^1_{\text{BS}i} \mathbf{h}\| ^2$
Subject to $\|\mathbf{B} \mathbf{h}\|_{\infty} = 1$ and $\langle \mathbf{h}, \mathbf{h}_{\text{ref}} \rangle > 0$
Source: [Mumford et al. 2012]
Recently a variation on the classical GLM was proposed by Mumford et al. that has been shown to improve estimation in highly correlated designs. The technique is simple, for each of the different stimuli, the design matrix is collapsed into two
Models: Least Absolute Error, Ordinal Logistic Regression , Multiclass Classification
Least Absolute Error
Model parameters: $\mathbf{w} \in \RR^p, b \in \RR$
Decision function: $f(\mathbf{x}_i) = b + \langle \mathbf{x}_i, \mathbf{w} \rangle$
Prediction: $\text{round}_{1 \leq i \leq k} (f(\mathbf{x}_i))$
Estimation: $\mathbf{w}^*, b^* \in \argmin_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n |y_i - f(\mathbf{x}_i)| + \lambda \|\mathbf{w}\|^2$
Regression model in which the prediction is given by rounding to the closest label.