On the consistency of ordinal regression methods

Category: misc
#consistency #machine learning

My latests work (with Francis Bach and Alexandre Gramfort) is on the consistency of ordinal regression methods. It has the wildly imaginative title of "On the Consistency of Ordinal Regression Methods" and is currently under review but you can read the draft of it on ArXiv. If you have any thoughts about it, please leave me a comment!

Update July 2017: this paper was published on the Journal of Machine Learning Research. The published version can be found here

Ordinal what?

The problem of ordinal regression is an old one in supervised learning. Its roots can be traced back to the works of McCullagh[1] in the 80s. It is a supervised learning problem that shares properties --yet is fundamentally different-- with both multiclass classification and regression. It can be seen as the problem of predicting a target variable from labeled observations, where the target label consists of discrete and ordered labels. As in the multiclass classification setting, the target variables are of discrete nature, and as in the regression setting (but unlike the multiclass setting) there is a meaningful order between the classes.

The most popular example of ordinal regression arise when the target variable is a human generated rating. For example, for a movie recommendation system, the target variable can have the possible values “do-not-bother” ≺ “only-if-you-must” ≺ “good” ≺ “verygood” ≺ “run-to-see”. Using multiclass classification to predict this target would yield a suboptimal classifier since it ignores the fact that there is a natural ordering between the labels. On the other hand, a regression algorithm assumes that the target variable is continuous, while here it is clearly discrete. Ordinal regression would be the ideal model for this target variable.

Fisher consistency

The notion of Fisher consistency is also an old one in statistics, and goes back to the work of Fisher at the beginning of the 20th century. The rigorous definition is stated in the paper, so here I'll just give an intuition.

In supervised learning, we observe random samples (in the form of pairs (target, sample) usually denoted $(y_i, X_i)$ ) from a population (lets call it P) and build a model that predicts the target when he seems a new sample. Fisher consistency can be seen as a sanity check on the learning model, that states that if instead of seeing a random sample we would have access to the full population P (which in real life never happens), then our classifier would have an accuracy as good as the best possible accuracy (such classifier is usually called Bayes rule or Bayes predictor).

Having Fisher consistency is an important property that "allows us to design good loss functions with desirable properties"[7]. Because of this, in the last decade the Fisher consistency of most used supervised learning methods has been investigated. It has been shown (see e.g.[2]) that most common methods for binary classification are consistent. For the multiclass case and ranking, the situation is more interesting, with some methods that are known to be inconsistent, such as one-vs-all SVM in multiclass classification and RankSVM.

The study of Fisher consistency for ordinal regression methods has been done for the first time (to the best of my knowledge) here and proves that despite the negative results of multiclass classification[3][4] and ranking[5][6], common ordinal regression methods are Fisher consistent. This brings ordinal regression closer to binary classification than to multiclass classification in this respect. And in fact, some results in the paper can be seen as generalization of known results for binary classification.


In the paper we study the Fisher consistency of some popular ordinal regression methods. The methods that we analyze are the following (see Table 1 in the paper of a definition): all threshold, cumulative link, immediate threshold and last absolute deviation. In the paper we present the following results

  1. McCullagh, Peter. "Regression models for ordinal data." Journal of the royal statistical society. Series B (Methodological) (1980). 

  2. Bartlett, Peter L., Michael I. Jordan, and Jon D. McAuliffe. "Convexity, classification, and risk bounds." Journal of the American Statistical Association (2006). 

  3. Tewari, Ambuj, and Peter L. Bartlett. "On the consistency of multiclass classification methods." The Journal of Machine Learning Research 8 (2007). 

  4. Zhang, Tong. "Statistical analysis of some multi-category large margin classification methods." The Journal of Machine Learning Research 5 (2004). 

  5. Duchi, John C., Lester W. Mackey, and Michael I. Jordan. "On the consistency of ranking algorithms." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010. 

  6. Calauzenes, Clément, Nicolas Usunier, and Patrick Gallinari. "On the (non-) existence of convex, calibrated surrogate losses for ranking." Neural Information Processing Systems. 2012. 

  7. Zhang, Tong. "Statistical behavior and consistency of classification methods based on convex risk minimization." Annals of Statistics (2004) 

  8. Ramaswamy, Harish G., and Shivani Agarwal. "Classification calibration dimension for general multiclass losses." Advances in Neural Information Processing Systems. 2012. 

Surrogate Loss Functions in Machine Learning

Category: misc
#machine learning #consistency #calibration

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given $n$ training samples $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$, where $X_i$ belongs to some sample space $\mathcal{X}$, usually $\mathbb{R}^p$ but for the purpose of this post we can keep i abstract, and $y_i \in \{-1, 1\}$ is an integer representing the class label.

We are also given a loss function $\ell: \{-1, 1\} \times \{-1, 1\} \to \mathbb{R}$ that measures the error of a given prediction. The value of the loss function $\ell$ at an arbitrary point $(y, \hat{y})$ is interpreted as the cost incurred by predicting $\hat{y}$ when the true label is $y$. In classification this function is often the zero-one loss, that is, $\ell(y, \hat{y})$ is zero when $y = \hat{y}$ and one otherwise.

The goal is to find a function $h: \mathcal{X} \to [k]$, the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function $h$ that minimizes the expected $\ell$-risk, given by $$ \mathcal{R}_{\ell}(h) = \mathbb{E}_{X \times Y}[\ell(Y, h(X))] $$

In theory, we could directly minimize the $\ell$-risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of $X \times Y$ is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function $\ell$ is discontinuous. Take for example a problem in which $\mathcal{X} = \mathbb{R}^2, k=2$, and we seek to find the linear function $f(X) = \text{sign}(X w), w \in \mathbb{R}^2$ and that minimizes the $\ell$-risk. As a function of the parameter $w$ this function looks something like

loss as function of w

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function $\Psi: \mathbb{R} \to \mathbb{R}_+$. An example of such surrogate loss functions is the hinge loss, $\Psi(t) = \max(1-t, 0)$, which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, $\Psi(t) = 1/(1 + \exp(-t))$, used by the logistic regression model. If we consider the logistic loss, minimizing the $\Psi$-risk, given by $\mathbb{E}_{X \times Y}[\Psi(Y, f(X))]$, of the function $f(X) = X w$ becomes a much more more tractable optimization problem:

In short, we have replaced the $\ell$-risk which is computationally difficult to optimize with the $\Psi$-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the $\Psi$-risk leads to a function that also minimizes the $\ell$-risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function $\Psi$: for some functions $\Psi$ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if $\Psi$ is convex then it is consistent if and only if it is differentiable at zero and $\Psi'(0) < 0$. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

  1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. 

  2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007.