Processing math: 100%

Feature extraction and supervised learning on fMRI:

from practice to theory

Fabian Pedregosa-Izquierdo, Parietal/Sierra (INRIA)

Advisors Francis Bach INRIA/ENS, Paris, France.
Alexandre Gramfort Telecom Paristech, Paris, France.
Reviewers Dimitri Van de Ville Univ. Geneva / EPFL, Geneva, CH.
Alain Rakotomamonjy University of Rouen, Rouen, France.
Examiners Ludovic Denoyer UPMC, Paris, France.
Marcel A.J. van Gerven Donders Institute/Radboud Univ., NL.
Bertrand Thirion INRIA/CEA, Saclay, France.

Study of cognitive function

Before the advent of non-invasive imaging modalities.

Patients

Lesions

Post-mortem analysis

Finding

Post-mortem analysis of “Tan” [Paul Broca, 1861]

1 / 35

fMRI-based study

fMRI scanner fMRI scans

Blood-oxygen-level dependent (BOLD) signal

  • Processing pipeline

Friederici, A. D. (2011) “The brain basis of language processing: from structure to function”. Physiological reviews

2 / 35

fMRI-based study

fMRI scanner fMRI scans

Blood-oxygen-level dependent (BOLD) signal

  • Processing pipeline
  • Preprocessing
  • Feature Extraction
  • Univariate Statistics
  • Supervised Learning

Answer increasingly complex cognitive questions.

E. Cauvet, “Traitement des structures syntaxiques dans le langage et dans la musique”, PhD thesis.

3 / 35

Feature extraction

Input: BOLD signal and an experimental paradigm.

Goal: output time-independent activation coefficients.

Delayed nature of the hemodynamic response (HRF).

4 / 35

Supervised Learning - Decoding

Predict stimuli from activation coefficients.

Contribution: decoding with ordinal values.

5 / 35

Outline

6 / 35

The General Linear Model (GLM)

where Xi=  si(t)

7 / 35

Basis-constrained HRF

Hemodynamic response function (HRF) is known to vary substantially across subjects, brain regions and age.

D. Handwerker et al., “Variation of BOLD hemodynamic responses across subjects and brain regions and their effects on statistical analyses.,” Neuroimage 2004.

S. Badillo et al., “Group-level impacts of within- and between-subject hemodynamic variability in fMRI,” Neuroimage 2013.

Two basis-constrained models of the HRF: FIR and 3HRF.

8 / 35

R1-GLM

h1 h2 h3 h4 h5 h6 h3k2 h3k1 h3k

9 / 35

Validation of R1-GLM

  • Improvement using decoding and encoding models.

10 / 35

Results

Cross-validation score in two different datasets

S. Tom et al., “The neural basis of loss aversion in decision-making under risk,” Science 2007.

K. N. Kay et al., “Identifying natural images from human brain activity.,” Nature 2008.

Results - Encoding

  • Measure: voxel-wise encoding score. Correlation with the BOLD at each voxel on left-out data.
  • R1-GLM (FIR basis) improves voxel-wise encoding score on more than 98% of the voxels.

11 / 35

Conclusions

  • Model for the joint estimation of HRF and activation coefficients.
  • Constrained estimation of the HRF while remaining computationally tractable (~1h for full brain estimation).
  • Validated the pertinence of this model by improvement in decoding and encoding score obtained on two fMRI datasets.
  • Article and software distribution
  • “Data-driven HRF estimation for encoding and decoding models.,” Pedregosa, Eickenberg et al., Neuroimage, Jan. 2015.
  • hrf_estimation: http://pypi.python.org/pypi/hrf_estimation

12 / 35

Outline

13 / 35

Decoding with ordinal labels

Xy

Activation coefficients

Target values

Target values can be

  • Binary {1,1}
  • Continuous {0.34,1,3,}
  • Ordinal: {1,2,3,,k}, 123 ...

How can we incorporate the order information into a decoding model ?

14 / 35

Loss functions

Two relevant loss functions:

  • A(y,ˆy)=|yˆy|.
  • Measures the distance betwen the predicted and true label.
  • Models: Ordinal Logistic Regression, Least Absolute Error, Cost-sensitive Multiclass
  • P(y1,y2,ˆy1,ˆy1)= 0 if sign(y1y2) = sign(ˆy1ˆy2)), 1 otherwise.
  • Measures ordering betweem (y1,y2) and (ˆy1,ˆy2).
  • Models: LogisticRank

2 loss functions 2 aspects of the ordinal problem.

15 / 35

Ordinal Logistic Regression

  • J. Rennie and N. Srebro, “Loss Functions for Preference Levels : Regression with Discrete Ordered Labels,” IJCAI 2005.

Generalization of logistic regression.

Parameters (p = num. dimensions, k = num. classes)

  • coefficients wRp
  • non-decreasing thresholds: θRk1

Prediction

1 + argmin{i:wTx>θi}

Surrogate loss function:

y1j=1logistic(θjwTx)+k1j=ylogistic(wTxθj)

16 / 35

Pairwise ranking

  • R. Herbrich et al., “Support Vector Learning for Ordinal Regression,” 1999.
  • T. Joachims, “Optimizing Search Engines using Clickthrough Data,” SIGKDD 2002.

Parameters (p = num. dimensions, k = num. classes)

  • coefficients wRp

Prediction: sign(wTxiwTxj)=sign(wT(xixj))

Surrogate loss function

logistic(sign(yiyj)(wT(xixj)))

can be implemented as a binary-class logistic regression model, ˆyij=yiyj,ˆxij=xixj

17 / 35

Two fMRI datasets, three tasks

1) Predict level of complexity of a phrase (6 levels)

E. Cauvet, Traitement des structures syntaxiques dans le langage et dans la musique, PhD thesis

2) Predict number of letters in a word.

3) Predict real size of objects from noun.

V. Borghesani, “A perceptual-to-conceptual gradient of word coding along the ventral path,” PRNI 2014.

18 / 35

Results - Mean Absolute Error

Top performers:

  • Ordinal Logistic
  • Least Absolute Error.

19 / 35

Results - Pairwise Disagreement

Top performer: RankLogistic

Application: role of language ROIs in complexity of sentences.

F. Pedregosa et al., “Learning to rank from medical imaging data,” MLMI 2012.

20 / 35

Conclusions

  • Two loss functions for the task of decoding with ordinal values.
  • Using 4 different models, use order information in context of fMRI-based brain decoding.
  • Benchmarked the methods on 3 tasks (2 datasets).
  • First usage of pairwise ranking in the context of fMRI, software distribution.
  • “Learning to rank from medical imaging data,” Pedregosa et al. MLMI 2012.
  • PySofia: https://pypi.python.org/pypi/pysofia

21 / 35

Outline

22 / 35

Ordinal Regression surrogates

  • IT) logistic(f(X)θy1)+logistic(θyf(X))
  • AT) y1i=1logistic(f(X)θi)+ki=ylogistic(θif(X))
  • J. Rennie and N. Srebro, “Loss Functions for Preference Levels : Regression with Discrete Ordered Labels,” IJCAI 2005.

23 / 35

Empirical comparison

  • [Rennie & Srebro 2005]
  • [Chu & Keerthi 2005]
  • [Lin & Li 2006]

Is there a reason that can explain this behaviour ?

0.400.450.500.550.600.650.700.750.80IT0.400.450.500.550.600.650.700.750.80AT

Mean Absolute Error

24 / 35

Fisher consistency

  • Goal supervised learning: minimize risk : R(h)=E((Y,h(X))).
  • computationally intractable. Minimize the surrogate riskRψ(h)=E(ψ(Y,h(X)))

Fisher consistency:

  • Relate minimizer of Rψ with the minimizer of R.
  • fargminfFRψ(f)R(f)=R(h) where h is a minimizer of the risk (Bayes rule)

25 / 35

Illustration

Binary-class linear problem on R2

  • a)   ni(yi,h(Xi))
loss as function of w
  • b)   nilogistic(yiXw))

26 / 35

Fisher consistency

Previous work

  • Binary classification
  • P. Bartlett et al., “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc. 2003.
  • Pairwise ranking
  • J. Duchi et al., “On the Consistency of Ranking Algorithms,” ICML 2010.
  • Ordinal Regression
  • Pedregosa et al., “On the Consistency of Ordinal Regression Methods.”, ArXiv preprint, Nov. 2014

27 / 35

Threshold-based surrogates

Propose a formulation that parametrizes the ordinal regression surrogates seen so far

ψ(y,α)=y1i=1Δ(y,i)ϕ(αi)k1i=yΔ(y,i)ϕ(αi)

where

  • Δ(y,i)=(y,i+1)(y,i)
  • ϕ:RR is a binary-class surrogate loss function (hinge, logistic, etc.).

  • For = zero-one loss ψ = Immediate Threshold (IT).
  • For = absolute error loss ψ = All Threshold (AT).

28 / 35

Consistency of ψ

ψ(y,α)=y1i=1Δ(y,i)ϕ(αi)k1i=yΔ(y,i)ϕ(αi)

  • Main result
  • ϕ:RR+ consistent with respect to the zero-one binary loss, i.e., ϕ is differentiable at zero and ϕ(0)<0
  • the surrogate ψ is consistent with respect to

  • IT) ϕ(f(X)θy1)+ϕ(θyf(X)) consistent w.r.t. 0-1 loss
  • AT) y1i=1ϕ(f(X)θi)+ki=yϕ(θif(X)) consistent w.r.t. absolute error

29 / 35

Illustration

  • Hinge loss
  • 8 Datasets
  • AT performs better w.r.t Absolute Error
  • IT performs better w.r.t. Zero-One Error

30 / 35

Other results

Consistency

  • Absolute Error surrogate: (ψA(y,α)=|yα|)
  • Consistent w.r.t. Absolute Error
  • Extends results of [H. Ramaswamy 2012]
  • Squared Error surrogate: ψS(y,α)=(yα)2
  • Consistent w.r.t. Squared Error
  • Proportional odds [McCullagh 1980]: P(yi|X)=σ(θiXw)
  • Consistent w.r.t. Absolute Error

31 / 35

Contributions

32 / 35

Perspectives

computational cost of R1-GLM

argminh, βyXvec(hβT)2

  • Common user case: > 50.000 voxels.
  • Unused structure:

    • Repeated design (X).
    • Spatial smoothness.

    More realistic noise model.

33 / 35

Perspectives

Consistency in a constrained space.

  • Current proof implicitly assumes decision functions can be separately defined for each sample.
  • Threshold-based decision functions are of the form (θ1f(X),θ2f(X),,θk1f(X)).
  • θ is shared among all samples.
  • Is it possible to obtain consistency results in this setting ?

34 / 35

Thanks for your attention

35 / 35

Backup slides

The Linear-Time-Invariant assumption

The model of the BOLD signal that is commonly used assumes a linear-time-invariant relationship between the BOLD signal and the neural response.

Source: [Poldrack et al. 2011]

  • Homogeneity: If a neural response is scaled by a factor of a, then the BOLD response is also scaled by this same factor of a.
  • Additivity: The response for two separate events is the sum of the independent signals.
  • Time invariant: If a stimulus is shifted t seconds, the BOLD response will be shifted by this same amount.

HRF variability

  • Across age
  • Across age: Matthew T. Colonnese, et al. “Development of hemodynamic responses and functional connectivity in rat somatosensory cortex”. Nature neuroscience 2007

Extension to separate designs

  • GLMS: Extension of the classical GLM proposed by [Mumford et al. 2012] that improves estimation in highly correlated designs.

R1-GLM with separate designs: R1-GLMS

Optimization problem of the form

kiyβiX0BSihriX1BSih2

Subject to Bh=1 and h,href>0

Source: [Mumford et al. 2012]

Results - Encoding

Results - Decoding

Absolute Error

Models: Least Absolute Error, Ordinal Logistic Regression , Multiclass Classification

Least Absolute Error

Model parameters: wRp,bR

Decision function: f(xi)=b+xi,w

Prediction: round1ik(f(xi))

Estimation: w,bargminw,b1nni=1|yif(xi)|+λw2

Regression model in which the prediction is given by rounding to the closest label.