Given $n$ training pairs $\{(X_1, y_1), \ldots, (X_n, y_n)\}$, $X_i \in \RR^2, y_i \in \{-1, +1\}$
Learn a hyperplane $w \in \mathbb{R}^2$ such that it separates both classes: $\text{sign}(w^T X_i) = y_i$
A decision function that separates both classes might not exist.
The next best thing $\implies$ minimize number of samples where $\text{sign}(w^T X_i) \neq y_i \iff y_i w^T X_i \lt 0$
$$\text{argmin}_w \frac{1}{n} \sum_{i=1}^n \ell(y_i w^T X_i)$$
where $\ell(t)$ =
$$\text{argmin}_{w \in \RR^2} \frac{1}{n} \sum_{i=1}^n \ell(y_i w^T X_i)$$
Minimization of the cost function:
Not amenable to gradient-based methods.
Known to be NP-hard!
Feldman, Vitaly, et al. "Agnostic learning of monomials by halfspaces is hard." SIAM Journal on Computing (2012)
Greedy algorithms (Convex) relaxations
Nguyen, Tan, and Scott Sanner. "Algorithms for direct 0--1 loss optimization in binary classification." ICML 2013.
SVM, Logistic Regression, Boosting, etc.
Replace the zero-one loss by a convex upper bound:
Notation: $\mathcal{X}$ = sample space, $\mathcal{Y}$ = target values, $P$ = probability distribution on $\mathcal{X} \times \mathcal{Y}$
Population setting: suppose we have access to distribution $P$ that generates the sample $\{(X_1, y_1), \ldots, (X_n, y_n)\}$
Fisher consistency:
Theorem: Let $\psi$ be convex. Then $\psi$ is classification-calibrated if and only if it is differentiable at 0 and $\psi'(0) \lt 0$
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2003). Convexity , Classification , and Risk Bounds. Journal of the American Statistical Association.
"... most of popular loss functions for binary classification are Fisher-consistent"
Zou, H., Zhu, J., & Hastie, T. (2008). New multicategory boosting algorithms based on multicategory Fisher-consistent losses. The Annals of Applied Statistics
Different approaches for SVM multiclass
One-vs-one $\implies$ consistent
One-vs-rest $\implies$ inconsistent
Liu, Y. (2007). Fisher Consistency of Multicategory Support Vector Machines. In NIPS 2007 (pp. 1–8).
Concept of non-dominant distribution
$P$ is called non-dominant if $P(y|X) \lt \frac{1}{2}$ for all $y \in \mathcal{X}$
Setting: same as multiclass classification, but labels have an order information.
How to include the order information within the model?
Probabilistic point of view:
$P(y \leq i | X) = \text{sigmoid}(f(X)) \quad $[McCullagh 1980]
Optimization point of view:
Minimize a convex surrogate of $\ell^{\text{OR}}(y, \hat{y}) = |y - \hat{y}|$ (absolute error)
Learn: $f: \mathcal{X} \to \RR^{k-1}$ ($k$ = num. of classes)
Prediction: $1 + \sum_{i=1}^{k-1} \ell(f(X)_i) $
$\implies \ell^{\text{OR}}(y, f(X))$ = $|y - 1 + \sum_{i=1}^{k-1} \ell(f(X)_i)|$
$= \sum_{i=1}^{y-1} \ell(-f(X)_i) + \sum_{i=y}^{k-1} \ell(f(X)_i)$
A natural surrogate is given by
All Thresholds (AT) $\sum_{i=1}^{y-1} \psi(-f(X)_i) + \sum_{i=y}^{k-1} \psi(f(X)_i)$
But Immediate Thresholds (IT) has also been proposed $\psi(-f(X)_{y-1}) + \psi(f(X)_y)$
Is there a reason that can explain this behaviour ?
Mean Absolute Error
So according to these studies it seems that that loss functions of the form b) perform better on the datasets they tried. Question: should we assume that because it performed better on some datasets b) is superior to a) ?.
The results that we will give in this section provide theoretical insight on this question.
Propose a formulation that parametrizes both surrogates
$$\Psi^{\ell}(y, f(X)) = \sum_{i = 1}^{y-1} \Delta \ell(y, i)\psi(f(X)_i) - \sum_{i=y}^{k-1}\Delta \ell(y, i)\psi(-f(X)_i)$$
where
$\Psi^{\ell}(y, f(X)) = \sum_{i = 1}^{y-1} \Delta \ell(y, i)\phi(f(X)_i) - \sum_{i=y}^{k-1}\Delta \ell(y, i)\phi(-f(X)_i)$
Motivation: search results ranking.
Information of clickthrough data:
$X_i$ = $i$-th search result
$X_3 > X_2$ and $X_4 > X_2$
"... clickthrough data does not convey absolute relevance judgments, but partial relative relevance judgments ...", Joachims, T. (2002). Optimizing Search Engines using Clickthrough Data. ACM SIGKDD.
Goal
Given: partial relevance judgements
Learn a function $f: \mathcal{X} \to \RR$ such that:$f(X_i) \gt f(X_j)$ if $X_i > X_j$.
$\implies$ sort the webpages such that $f(X_a) \geq f(X_b) \geq f(X_c) \ldots$
Pairwise disagreement loss: Given a Directed Acyclical Graph $G$, $$\ell(G, f(X)) = \sum_{i \to j \in G} \ell(f(X_i) - f(X_j)) $$
Surrogate:$$\psi(G, f(X)) = \sum_{i \to j \in G} \phi(f(X_i) - f(X_j)) $$
Where $\phi$ is a surrogate of the 0-1 loss. For $\phi$=hinge loss, it is known as RankSVM.
Objective: $$ \argmin_f \mathbb{E}_{X, G}[\ell(G, f)]$$
Surrogate:$$\psi(G, f(X)) = \sum_{i \to j \in G} \phi(f(X_i) - f(X_j)) $$
Result: commonly used surrogates (e.g. RankSVM) are inconsistent
Theorem: if surrogate loss is consistent, then it is non-convex.
Duchi, J. C., Mackey, L. W., & Jordan, M. I. (2010). On the Consistency of Ranking Algorithms. In Proceedings of the 27th International Conference on Machine Learning.
Direct proof and extended to other ranking loss functions Calauzènes, C., Usunier, N., & Gallinari, P. (2012). On the ( Non- ) existence of Convex , Calibrated Surrogate Losses for Ranking. NIPS 2012.
$\mathcal{F}$ can be e.g. the space of smooth functions.
Result: Given $\mathcal{F}$ "regular" (Im($\mathcal{F}$) = $\RR$): $\mathcal{F}$-consistency $\implies$ consistency
Shi, Q., Reid, M., Caetano, T., & Hengel, A. Van Den. (2014). A Hybrid Loss for Multiclass and Structured Prediction (TPAMI)
Co-authors
Francis Bach
Alexandre Gramfort
Models: Least Absolute Error, Ordinal Logistic Regression , Multiclass Classification
Least Absolute Error
Model parameters: $\mathbf{w} \in \RR^p, b \in \RR$
Decision function: $f(\mathbf{x}_i) = b + \langle \mathbf{x}_i, \mathbf{w} \rangle$
Prediction: $\text{round}_{1 \leq i \leq k} (f(\mathbf{x}_i))$
Estimation: $\mathbf{w}^*, b^* \in \argmin_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n |y_i - f(\mathbf{x}_i)| + \lambda \|\mathbf{w}\|^2$
Regression model in which the prediction is given by rounding to the closest label.