# fa.bianp.net

TL;DR: I describe a method for hyperparameter optimization by gradient descent.

Most machine learning models rely on at least one hyperparameter to control for model complexity. For example, logistic regression commonly relies on a regularization parameter that controls the amount of $\ell_2$ regularization. Similarly, kernel methods also have hyperparameters that control for properties of the kernel, such as the "width" parameter in the RBF kernel. The fundamental distinction between model parameters and hyperparameters is that, while model parameters are estimated by minimizing a goodness of fit with the training data, hyperparameters need to be estimated by other means (such as a cross-validation loss), as otherwise models with excessive would be selected, a phenomenon known as overfitting. We can use an approximate gradient to optimize a cross-validation loss with respect to hyperparameters. A decreasing bound between the true gradient and the approximate gradient ensures that the method converges towards a local minima.

Fitting hyperparameters is essential to obtain models with good accuracy and computationally challenging. The existing most popular methods for fitting hyperparameters are based on either exhaustively exploring the whole hyperparameter space (grid search and random search) or on Bayesian optimization techniques that use previous function evaluations to guide the optimization procedure. The starting point of this work was a simple question: why are the procedures to estimate parameters and hyperparameters so different? Is it possible to use known and reliable methods such as gradient descent to fit not only parameters, but also hyperparameters?

Interestingly, I found out that this question had been answered a long time ago. Already in the 90s, Larsen et al. devised a method (described here and here) using gradient-descent to estimate the optimal value of $\ell_2$ regularization for neural networks. Shortly after, Y. Bengio also published a paper on this topic. Recently, there has been a renewed interest in gradient-based methods (see for example this paper by Maclaurin or a slightly earlier work by Justin Domke, and references therein).

One of the drawbacks of gradient-based optimization of hyperparameters, is that these depend on quantities that are costly to compute such as the exact value of the model parameters and the inverse of a Hessian matrix. The aim of this work is to relax some of these assumptions and provide a method that works when the quantities involved (such as model parameters) are known only approximately. In practice, what this means is that hyperparameters can be updated before model parameters have fully converged, which results in big computational gains. For more details and experiments, please take a look at the paper.

This paper was presented at the International Conference on Machine Learning (ICML 2016).Code is now available in github and these are the slides I used for the occasion:

### Reviews

The original ICML reviews for this paper can be seen here and the rebuttal here. These were high quality reviews that had some rightful concenrs with the first version of the manuscript. In fact, 2 out of 3 reviewers gave a "weak reject" in this first phase. Luckily these concerns could be contested in the rebuttal (the final manuscript was updated accordingly) and the 2 reviewers that gave a "weak reject" changed their rating to "weak accept" and the paper was finally accepted.

• The outer loss function does not depend directly on the regularization parameter $\lambda$. Why is it there?

When the outer loss is a cross-validation loss this is true, but other criteria might depend on this parameter, such as the SURE criteria (see e.g. Deladalle et al. 2014).

### Citing

Please cite this work if the paper or its associated code are relevant for you. You can use the following bibtex:

  @inproceedings{PedregosaHyperparameter16,
author    = {Fabian Pedregosa},
title     = {Hyperparameter optimization with approximate gradient},
booktitle = {Proceedings of the 33nd International Conference on Machine Learning ({ICML})},
year      = {2016},
url       = {http://jmlr.org/proceedings/papers/v48/pedregosa16.html},
}


### Erratum

An early version of the paper contained the following typo: the first equation in Theorem 2 should read $\sum_{i=1}^\infty \varepsilon_i < \infty$ instead of $\sum_{i=1}^\infty \varepsilon_i \leq \infty$ (note the $<$ in the first equation versus $\leq$ in the second one). This typo has been corrected in both ArXiv version and the proceedings.

Announce: first public release of lightning!, a library for large-scale linear classification, regression and ranking in Python. The library was started a couple of years ago by Mathieu Blondel who also contributed the vast majority of source code. I joined recently its development and decided it was about time for a v0.1!.

Prebuild conda packages are available for all operating systems (god thank appveyor). More information on lightning's website.

Together with other scikit-learn developers we've created an umbrella organization for scikit-learn-related projects named scikit-learn-contrib. The idea is for this organization to host projects that are deemed too specific or too experimental to be included in the scikit-learn codebase but still offer an API which is compatible with scikit-learn and would like to benefit of the visibility of being labeled as scikit-learn-compatible.

We've set two requirements to being under this umbrella right now (this might evolve in the future). The first requirement is to have a scikit-learn compatible API, i.e., to follow the guide on the scikit-learn documentation so that objects can be used by scikit-learn meta-estimators (such as GridSearchCV). The second condition is that projects should be actively maintaned and have a high-quality codebase. Judging the quality of a codebase is difficult and subjective, but we agreed that at the bare minimum, the source code should be tested using continuous integration tools such as travis and reach a good test coverage (above 80%). More information is not available on the scikit-learn-contrib repository.

The first project to be hosted by this organization is lightning, but we hope that others will follow. If you would like to submit a new project, open an issue at the main project and we will look into it. There is also a project template for new and old projects.

Recently I've implemented, together with Arnaud Rachez, the SAGA[1] algorithm in the lightning machine learning library (which by the way, has been recently moved to the new scikit-learn-contrib project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA algorithm, its performance is similar to other variance-reduced stochastic algorithms such as SAG[3] or SVRG[2] but it has the advantage with respect to SAG[3] that it allows non-smooth penalty terms (such as $\ell_1$ regularization). It is implemented in lightning as SAGAClassifier and SAGARegressor.

We have taken care to make this implementation as efficient as possible. As for most stochastic gradient algorithms, a naive implementation takes 3 lines of code and is straightforward to implement. However, there are many tricks that are time-consuming and error-prone to implement but make a huge difference in efficiency.

A small example, more as a sanity check than to claim anything. The following plot shows the suboptimality as a function of time for three similar methods: SAG, SAGA and SVRG. The dataset used in the RCV1 dataset (test set, obtained from the libsvm webpage), consisting of 677.399 samples and 47.236 features. Interestingly, all methods can solve this rather large-scale problem within a few seconds. Within them, SAG and SAGA have a very similar performance and SVRG seems to be reasonably faster.

A note about the benchmarks: it is difficult to compare fairly stochastic gradient methods because at the end it usually boils down to how you choose the step size. In this plot I set the step size of all methods to 1/(3L) , where L is the Lipschitz constant of the objective function, as I think this is a popular choice. I would have prefered 1/L but SVRG was not converging for this step size. The code for the benchmarks can be found here.

1. A. Defazio, F. Bach & S. Lacoste-Julien. "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives" (2014).

2. Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in Neural Information Processing Systems. 2013.

3. Mark Schmidt, Nicolas Le Roux, and Francis Bach. "Minimizing finite sums with the stochastic average gradient." arXiv preprint arXiv:1309.2388 (2013).

My latests work (with Francis Bach and Alexandre Gramfort) is on the consistency of ordinal regression methods. It has the wildly imaginative title of "On the Consistency of Ordinal Regression Methods" and is currently under review but you can read the draft of it on ArXiv. If you have any thoughts about it, please leave me a comment!

Update July 2017: this paper was published on the Journal of Machine Learning Research. The published version can be found here

### Ordinal what?

The problem of ordinal regression is an old one in supervised learning. Its roots can be traced back to the works of McCullagh[1] in the 80s. It is a supervised learning problem that shares properties --yet is fundamentally different-- with both multiclass classification and regression. It can be seen as the problem of predicting a target variable from labeled observations, where the target label consists of discrete and ordered labels. As in the multiclass classification setting, the target variables are of discrete nature, and as in the regression setting (but unlike the multiclass setting) there is a meaningful order between the classes.

The most popular example of ordinal regression arise when the target variable is a human generated rating. For example, for a movie recommendation system, the target variable can have the possible values “do-not-bother” ≺ “only-if-you-must” ≺ “good” ≺ “verygood” ≺ “run-to-see”. Using multiclass classification to predict this target would yield a suboptimal classifier since it ignores the fact that there is a natural ordering between the labels. On the other hand, a regression algorithm assumes that the target variable is continuous, while here it is clearly discrete. Ordinal regression would be the ideal model for this target variable.

### Fisher consistency

The notion of Fisher consistency is also an old one in statistics, and goes back to the work of Fisher at the beginning of the 20th century. The rigorous definition is stated in the paper, so here I'll just give an intuition.

In supervised learning, we observe random samples (in the form of pairs (target, sample) usually denoted $(y_i, X_i)$ ) from a population (lets call it P) and build a model that predicts the target when he seems a new sample. Fisher consistency can be seen as a sanity check on the learning model, that states that if instead of seeing a random sample we would have access to the full population P (which in real life never happens), then our classifier would have an accuracy as good as the best possible accuracy (such classifier is usually called Bayes rule or Bayes predictor).

Having Fisher consistency is an important property that "allows us to design good loss functions with desirable properties"[7]. Because of this, in the last decade the Fisher consistency of most used supervised learning methods has been investigated. It has been shown (see e.g.[2]) that most common methods for binary classification are consistent. For the multiclass case and ranking, the situation is more interesting, with some methods that are known to be inconsistent, such as one-vs-all SVM in multiclass classification and RankSVM.

The study of Fisher consistency for ordinal regression methods has been done for the first time (to the best of my knowledge) here and proves that despite the negative results of multiclass classification[3][4] and ranking[5][6], common ordinal regression methods are Fisher consistent. This brings ordinal regression closer to binary classification than to multiclass classification in this respect. And in fact, some results in the paper can be seen as generalization of known results for binary classification.

### Highlights

In the paper we study the Fisher consistency of some popular ordinal regression methods. The methods that we analyze are the following (see Table 1 in the paper of a definition): all threshold, cumulative link, immediate threshold and last absolute deviation. In the paper we present the following results

• We characterize the consistency of all threshold method and immediate threshold by the derivative at zero of an auxiliary function.
• We provide an excess risk bound of the all threshold loss.
• Prove consistency of the least absolute deviation. This was already done for the case of three classes by Ramaswamy et al.[8], here we extend the proof to an arbitrary number of classes.
• Prove consistency of cumulative link model (a model that includes the venerable proportional odds model of McCullagh1).
• The consistency analysis suggest a novel loss function when optimizing a least squares metric. We test this novel model on different datasets and report that it performs very competitively.

1. McCullagh, Peter. "Regression models for ordinal data." Journal of the royal statistical society. Series B (Methodological) (1980).

2. Bartlett, Peter L., Michael I. Jordan, and Jon D. McAuliffe. "Convexity, classification, and risk bounds." Journal of the American Statistical Association (2006).

3. Tewari, Ambuj, and Peter L. Bartlett. "On the consistency of multiclass classification methods." The Journal of Machine Learning Research 8 (2007).

4. Zhang, Tong. "Statistical analysis of some multi-category large margin classification methods." The Journal of Machine Learning Research 5 (2004).

5. Duchi, John C., Lester W. Mackey, and Michael I. Jordan. "On the consistency of ranking algorithms." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010.

6. Calauzenes, Clément, Nicolas Usunier, and Patrick Gallinari. "On the (non-) existence of convex, calibrated surrogate losses for ranking." Neural Information Processing Systems. 2012.

7. Zhang, Tong. "Statistical behavior and consistency of classification methods based on convex risk minimization." Annals of Statistics (2004)

8. Ramaswamy, Harish G., and Shivani Agarwal. "Classification calibration dimension for general multiclass losses." Advances in Neural Information Processing Systems. 2012.

Cross-validation iterators in scikit-learn are simply generator objects, that is, Python objects that implement the __iter__ method and that for each call to this method return (or more precisely, yield) the indices or a boolean mask for the train and test set. Hence, implementing new cross-validation iterators that behave as the ones in scikit-learn is easy with this in mind. Here goes a small code snippet that implements a holdout cross-validator generator following the scikit-learn API.

import numpy as np
from sklearn.utils import check_random_state

class HoldOut:
"""
Hold-out cross-validator generator. In the hold-out, the
data is split only once into a train set and a test set.
Unlike in other cross-validation schemes, the hold-out
consists of only one iteration.

Parameters
----------
n : total number of samples
test_size : 0 < float < 1
Fraction of samples to use as test set. Must be a
number between 0 and 1.
random_state : int
Seed for the random number generator.
"""
def __init__(self, n, test_size=0.2, random_state=0):
self.n = n
self.test_size = test_size
self.random_state = random_state

def __iter__(self):
n_test = int(np.ceil(self.test_size * self.n))
n_train = self.n - n_test
rng = check_random_state(self.random_state)
permutation = rng.permutation(self.n)
ind_test = permutation[:n_test]
ind_train = permutation[n_test:n_test + n_train]
yield ind_train, ind_test


Contrary to other cross-validation schemes, holdout relies on a single split of the data. It is well known than in practice holdout performs much worse than KFold or LeaveOneOut schemes. However, holdout has the advantage that its theoretical properties are easier to derive. For examples of this see e.g. Section 8.7 of Theory of classification: a survey of some recent advances and the very recent The reusable holdout.

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given $n$ training samples $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$, where $X_i$ belongs to some sample space $\mathcal{X}$, usually $\mathbb{R}^p$ but for the purpose of this post we can keep i abstract, and $y_i \in \{-1, 1\}$ is an integer representing the class label.

We are also given a loss function $\ell: \{-1, 1\} \times \{-1, 1\} \to \mathbb{R}$ that measures the error of a given prediction. The value of the loss function $\ell$ at an arbitrary point $(y, \hat{y})$ is interpreted as the cost incurred by predicting $\hat{y}$ when the true label is $y$. In classification this function is often the zero-one loss, that is, $\ell(y, \hat{y})$ is zero when $y = \hat{y}$ and one otherwise.

The goal is to find a function $h: \mathcal{X} \to [k]$, the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function $h$ that minimizes the expected $\ell$-risk, given by $$\mathcal{R}_{\ell}(h) = \mathbb{E}_{X \times Y}[\ell(Y, h(X))]$$

In theory, we could directly minimize the $\ell$-risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of $X \times Y$ is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function $\ell$ is discontinuous. Take for example a problem in which $\mathcal{X} = \mathbb{R}^2, k=2$, and we seek to find the linear function $f(X) = \text{sign}(X w), w \in \mathbb{R}^2$ and that minimizes the $\ell$-risk. As a function of the parameter $w$ this function looks something like

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function $\Psi: \mathbb{R} \to \mathbb{R}_+$. An example of such surrogate loss functions is the hinge loss, $\Psi(t) = \max(1-t, 0)$, which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, $\Psi(t) = 1/(1 + \exp(-t))$, used by the logistic regression model. If we consider the logistic loss, minimizing the $\Psi$-risk, given by $\mathbb{E}_{X \times Y}[\Psi(Y, f(X))]$, of the function $f(X) = X w$ becomes a much more more tractable optimization problem:

In short, we have replaced the $\ell$-risk which is computationally difficult to optimize with the $\Psi$-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the $\Psi$-risk leads to a function that also minimizes the $\ell$-risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function $\Psi$: for some functions $\Psi$ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if $\Psi$ is convex then it is consistent if and only if it is differentiable at zero and $\Psi'(0) < 0$. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003.

2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007.

In this post I compar several implementations of Logistic Regression. The task was to implement a Logistic Regression model using standard optimization tools from scipy.optimize and compare them against state of the art implementations such as LIBLINEAR.

In this blog post I'll write down all the implementation details of this model, in the hope that not only the conclusions but also the process would be useful for future comparisons and benchmarks.

## Function evaluation

We consider the case in which the decision function is an affine function, i.e., $f(x) = \langle x, w \rangle + c$ where $w$ and $c$ are parameters to estimate. The loss function for the $\ell_2$-regularized logistic regression, i.e. the function to be minimized is

$$\mathcal{L}(w, \lambda, X, y) = - \frac{1}{n}\sum_{i=1}^n \log(\phi(y_i (\langle X_i, w \rangle + c))) + \frac{\lambda}{2} w^T w$$

where $\phi(t) = 1. / (1 + \exp(-t))$ is the logistic function, $\lambda w^T w$ is the regularization term and $X, y$ is the input data, with $X \in \mathbb{R}^{n \times p}$ and $y \in \{-1, 1\}^n$. However, this formulation is not great from a practical standpoint. Even for not unlikely values of $t$ such as $t = -100$, $\exp(100)$ will overflow, assigning the loss an (erroneous) value of $+\infty$. For this reason 1, we evaluate $\log(\phi(t))$ as

$$\log(\phi(t)) = \begin{cases} - \log(1 + \exp(-t)) \text{ if } t > 0 \\ t - \log(1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases}$$

The gradient of the loss function is given by

\begin{aligned} \nabla_w \mathcal{L} &= \frac{1}{n}\sum_{i=1}^n y_i X_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) + \lambda w \\ \nabla_c \mathcal{L} &= \sum_{i=1}^n y_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) \end{aligned}

Similarly, the logistic function $\phi$ used here can be computed in a more stable way using the formula

$$\phi(t) = \begin{cases} 1 / (1 + \exp(-t)) \text{ if } t > 0 \\ \exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases}$$

Finally, we will also need the Hessian for some second-order methods, which is given by

$$\nabla_w ^2 \mathcal{L} = X^T D X + \lambda I$$

where $I$ is the identity matrix and $D$ is a diagonal matrix given by $D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i))$.

In Python, these function can be written as

import numpy as np

def phi(t):
# logistic function, returns 1 / (1 + exp(-t))
idx = t > 0
out = np.empty(t.size, dtype=np.float)
out[idx] = 1. / (1 + np.exp(-t[idx]))
exp_t = np.exp(t[~idx])
out[~idx] = exp_t / (1. + exp_t)
return out

def loss(x0, X, y, alpha):
# logistic loss function, returns Sum{-log(phi(t))}
w, c = x0[:X.shape[1]], x0[-1]
z = X.dot(w) + c
yz = y * z
idx = yz > 0
out = np.zeros_like(yz)
out[idx] = np.log(1 + np.exp(-yz[idx]))
out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
out = out.sum() / X.shape[0] + .5 * alpha * w.dot(w)
return out

# gradient of the logistic loss
w, c = x0[:X.shape[1]], x0[-1]
z = X.dot(w) + c
z = phi(y * z)
z0 = (z - 1) * y
grad_w = X.T.dot(z0) / X.shape[0] + alpha * w


## Benchmarks

I tried several methods to estimate this $\ell_2$-regularized logistic regression. There is one first-order method (that is, it only makes use of the gradient and not of the Hessian), Conjugate Gradient whereas all the others are Quasi-Newton methods. The method I tested are:

• CG = Conjugate Gradient as implemented in scipy.optimize.fmin_cg
• TNC = Truncated Newton as implemented in scipy.optimize.fmin_tnc
• BFGS = Broyden–Fletcher–Goldfarb–Shanno method, as implemented in scipy.optimize.fmin_bfgs.
• L-BFGS = Limited-memory BFGS as implemented in scipy.optimize.fmin_l_bfgs_b. Contrary to the BFGS algorithm, which is written in Python, this one wraps a C implementation.
• Trust Region = Trust Region Newton method 1. This is the solver used by LIBLINEAR that I've wrapped to accept any Python function in the package pytron

To assure the most accurate results across implementations, all timings were collected by callback functions that were called from the algorithm on each iteration. Finally, I plot the maximum absolute value of the gradient (=the infinity norm of the gradient) with respect to time.

The synthetic data used in the benchmarks was generated as described in 2 and consists primarily of the design matrix $X$ being Gaussian noise, the vector of coefficients is drawn also from a Gaussian distribution and the explained variable $y$ is generated as $y = \text{sign}(X w)$. We then perturb matrix $X$ by adding Gaussian noise with covariance 0.8. The number of samples and features was fixed to $10^4$ and $10^3$ respectively. The penalization parameter $\lambda$ was fixed to 1.

In this setting variables are typically uncorrelated and most solvers perform decently:

Here, the Trust Region and L-BFGS solver perform almost equally good, with Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.

To make things more interesting, we now make the design to be slightly more correlated. We do so by adding a constant term of 1 to the matrix $X$ and adding also a column vector of ones this matrix to account for the intercept. These are the results:

Here, we already see that second-order methods dominate over first-order methods (well, except for BFGS), with Trust Region clearly dominating the picture but with TNC not far behind.

Finally, if we force the matrix to be even more correlated (we add 10. to the design matrix $X$), then we have:

Here, the Trust-Region method has the same timing as before, but all other methods have got substantially worse.The Trust Region method, unlike the other methods is surprisingly robust to correlated designs.

To sum up, the Trust Region method performs extremely well for optimizing the Logistic Regression model under different conditionings of the design matrix. The LIBLINEAR software uses this solver and thus has similar performance, with the sole exception that the evaluation of the logistic function and its derivatives is done in C++ instead of Python. In practice, however, due to the small number of iterations of this solver I haven't seen any significant difference.

1. A similar development can be found in the source code of LIBLINEAR, and is probably also used elsewhere.

2. "A comparison of numerical optimizers for logistic regression", P. Minka, URL

3. "Newton's Method for Large Bound-Constrained Optimization Problems", Chih-Jen Lin, Jorge J. More URL

TL;DR: I've implemented a logistic ordinal regression or proportional odds model. Here is the Python code

The logistic ordinal regression model, also known as the proportional odds was introduced in the early 80s by McCullagh [1, 2] and is a generalized linear model specially tailored for the case of predicting ordinal variables, that is, variables that are discrete (as in classification) but which can be ordered (as in regression). It can be seen as an extension of the logistic regression model to the ordinal setting.

Given $X \in \mathbb{R}^{n \times p}$ input data and $y \in \mathbb{N}^n$ target values. For simplicity we assume $y$ is a non-decreasing vector, that is, $y_1 \leq y_2 \leq ...$. Just as the logistic regression models posterior probability $P(y=j|X_i)$ as the logistic function, in the logistic ordinal regression we model the cummulative probability as the logistic function. That is,

$$P(y \leq j|X_i) = \phi(\theta_j - w^T X_i) = \frac{1}{1 + \exp(w^T X_i - \theta_j)}$$

where $w, \theta$ are vectors to be estimated from the data and $\phi$ is the logistic function defined as $\phi(t) = 1 / (1 + \exp(-t))$.

Compared to multiclass logistic regression, we have added the constrain that the hyperplanes that separate the different classes are parallel for all classes, that is, the vector $w$ is common across classes. To decide to which class will $X_i$ be predicted we make use of the vector of thresholds $\theta$. If there are $K$ different classes, $\theta$ is a non-decreasing vector (that is, $\theta_1 \leq \theta_2 \leq ... \leq \theta_{K-1}$) of size $K-1$. We will then assign the class $j$ if the prediction $w^T X$ (recall that it's a linear model) lies in the interval $[\theta_{j-1}, \theta_{j}[$. In order to keep the same definition for extremal classes, we define $\theta_{0} = - \infty$ and $\theta_K = + \infty$.

The intuition is that we are seeking a vector $w$ such that $X w$ produces a set of values that are well separated into the different classes by the different thresholds $\theta$. We choose a logistic function to model the probability $P(y \leq j|X_i)$ but other choices are possible. In the proportional hazards model 1 the probability is modeled as $-\log(1 - P(y \leq j | X_i)) = \exp(\theta_j - w^T X_i)$. Other link functions are possible, where the link function satisfies $\text{link}(P(y \leq j | X_i)) = \theta_j - w^T X_i$. Under this framework, the logistic ordinal regression model has a logistic link function and the proportional hazards model has a log-log link function.

The logistic ordinal regression model is also known as the proportional odds model, because the ratio of corresponding odds for two different samples $X_1$ and $X_2$ is $\exp(w^T(X_1 - X_2))$ and so does not depend on the class $j$ but only on the difference between the samples $X_1$ and $X_2$.

### Optimization

Model estimation can be posed as an optimization problem. Here, we minimize the loss function for the model, defined as minus the log-likelihood:

$$\mathcal{L}(w, \theta) = - \sum_{i=1}^n \log(\phi(\theta_{y_i} - w^T X_i) - \phi(\theta_{y_i -1} - w^T X_i))$$

In this sum all terms are convex on $w$, thus the loss function is convex over $w$. It might be also jointly convex over $w$ and $\theta$, although I haven't checked. I use the function fmin_slsqp in scipy.optimize to optimize $\mathcal{L}$ under the constraint that $\theta$ is a non-decreasing vector. There might be better options, I don't know. If you do know, please leave a comment!.

Using the formula $\log(\phi(t))^\prime = (1 - \phi(t))$, we can compute the gradient of the loss function as

\begin{align} \nabla_w \mathcal{L}(w, \theta) &= \sum_{i=1}^n X_i (1 - \phi(\theta_{y_i} - w^T X_i) - \phi(\theta_{y_i-1} - w^T X_i)) \\ % \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n - \frac{e_{y_i} \exp(\theta_{y_i}) - e_{y_i -1} \exp(\theta_{y_i -1})}{\exp(\theta_{y_i}) - \exp(\theta_{y_i-1})} \\ \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n e_{y_i} \left(1 - \phi(\theta_{y_i} - w^T X_i) - \frac{1}{1 - \exp(\theta_{y_i -1} - \theta_{y_i})}\right) \\ & \qquad + e_{y_i -1}\left(1 - \phi(\theta_{y_i -1} - w^T X_i) - \frac{1}{1 - \exp(- (\theta_{y_i-1} - \theta_{y_i}))}\right) \end{align}

where $e_i$ is the $i$th canonical vector.

### Code

I've implemented a Python version of this algorithm using Scipy's optimize.fmin_slsqp function. This takes as arguments the loss function, the gradient denoted before and a function that is > 0 when the inequalities on $\theta$ are satisfied.

Code can be found here as part of the minirank package, which is my sandbox for code related to ranking and ordinal regression. At some point I would like to submit it to scikit-learn but right now the I don't know how the code will scale to medium-scale problems, but I suspect not great. On top of that I'm not sure if there is a real demand of these models for scikit-learn and I don't want to bloat the package with unused features.

### Performance

I compared the prediction accuracy of this model in the sense of mean absolute error (IPython notebook) on the boston house-prices dataset. To have an ordinal variable, I rounded the values to the closest integer, which gave me a problem of size 506 $\times$ 13 with 46 different target values. Although not a huge increase in accuracy, this model did give me better results on this particular dataset:

Here, ordinal logistic regression is the best-performing model, followed by a Linear Regression model and a One-versus-All Logistic regression model as implemented in scikit-learn.

1. "Regression models for ordinal data", P. McCullagh, Journal of the royal statistical society. Series B (Methodological), 1980

2. "Generalized Linear Models", P. McCullagh and J. A. Nelder (Book)

3. "Loss Functions for Preference Levels : Regression with Discrete Ordered Labels", Jason D. M. Rennie, Nathan Srebro

My latest contribution for scikit-learn is an implementation of the isotonic regression model that I coded with Nelle Varoquaux and Alexandre Gramfort. This model finds the best least squares fit to a set of points, given the constraint that the fit must be a non-decreasing function. The example on the scikit-learn website gives an intuition on this model.

The original points are in red, and the estimated ones are in green. As you can see, there is one estimation (green point) for each data sample (red point). Calling $y \in \mathbb{R}^n$ the input data, the model can be written concisely as an optimization problem over $x$

$$\text{argmin}_x |y - x |^2 \\ \text{subject to } x_0 \leq x_1 \leq \cdots \leq x_n$$

The algorithm implemented in scikit-learn 3 is the pool adjacent violators algorithm 1, which is an efficient linear time $\mathcal{O}(n)$ algorithm. The algorithm sweeps through the data looking for violations of the monotonicity constraint. When it finds one, it adjusts the estimate to the best possible fit with constraints. Sometimes it also needs to modify previous points to make sure the new estimate does not violate the constraints. The following picture shows how it proceeds at each iteration

1. "Active set algorithms for isotonic regression; A unifying framework", Michael J. Best, Nilotpal Chakravarti

2. Python notebook to generate the figures: ipynb and web version

3. The algorithm is used through the sklearn.isotonic.IsotonicRegression object (doc) or the function sklearn.isotonic.isotonic_regression (doc