TL;DR: I've implemented a logistic ordinal regression or
proportional odds model. Here is the Python code
The logistic ordinal regression model, also known as the
proportional odds was introduced in the early 80s by McCullagh [^{1}, ^{2}]
and is a generalized linear model specially tailored for the case of
predicting ordinal variables, that is, variables that are discrete (as
in classification) but which can be ordered (as in regression). It can
be seen as an extension of the logistic regression model to the
ordinal setting.
Given $X \in \mathbb{R}^{n \times p}$ input data and $y \in
\mathbb{N}^n$ target values. For simplicity we assume $y$ is a
non-decreasing vector, that is, $y_1 \leq y_2 \leq ...$. Just as the
logistic regression models posterior probability $P(y=j|X_i)$ as the
logistic function, in the logistic ordinal regression we model the
cummulative probability as the logistic function. That is,
where $w, \theta$ are vectors to be estimated from the data and $\phi$
is the logistic function defined as $\phi(t) = 1 / (1 + \exp(-t))$.
Compared to multiclass logistic regression, we have added the
constrain that the hyperplanes that separate the different classes are
parallel for all classes, that is, the vector $w$ is common across
classes. To decide to which class will $X_i$ be predicted we make use
of the vector of thresholds $\theta$. If there are $K$ different
classes, $\theta$ is a non-decreasing vector (that is, $\theta_1 \leq
\theta_2 \leq ... \leq \theta_{K-1}$) of size $K-1$. We will then
assign the class $j$ if the prediction $w^T X$ (recall that it's a
linear model) lies in the interval $[\theta_{j-1}, \theta_{j}[$. In
order to keep the same definition for extremal classes, we define
$\theta_{0} = - \infty$ and $\theta_K = + \infty$.
The intuition is that we are seeking a vector $w$ such that $X w$
produces a set of values that are well separated into the different
classes by the different thresholds $\theta$. We choose a logistic
function to model the probability $P(y \leq j|X_i)$ but other choices
are possible. In the proportional hazards model ^{1} the probability
is modeled as $-\log(1 - P(y \leq j | X_i)) = \exp(\theta_j - w^T
X_i)$. Other link functions are possible, where the link function
satisfies $\text{link}(P(y \leq j | X_i)) = \theta_j - w^T X_i$. Under
this framework, the logistic ordinal regression model has a logistic
link function and the proportional hazards model has a log-log link
function.
The logistic ordinal regression model is also known as the
proportional odds model, because the
ratio of corresponding odds
for two different samples $X_1$ and $X_2$ is $\exp(w^T(X_1 - X_2))$ and
so does not depend on the class $j$ but only on the difference between
the samples $X_1$ and $X_2$.
Optimization
Model estimation can be posed as an optimization problem. Here, we
minimize the loss function for the model, defined as minus the
log-likelihood:
In this sum all terms are convex on $w$, thus the loss function is
convex over $w$. It might be also jointly convex over $w$ and
$\theta$, although I haven't checked. I use the function
fmin_slsqp in scipy.optimize to optimize
$\mathcal{L}$ under the constraint that $\theta$ is a non-decreasing
vector. There might be better options, I don't know. If you do know,
please leave a comment!.
Using the formula $\log(\phi(t))^\prime = (1 - \phi(t))$, we can compute the gradient of the loss function as
I've implemented a Python version of this algorithm using Scipy's
optimize.fmin_slsqp function. This takes as arguments the
loss function, the gradient denoted before and a function that is
> 0 when the inequalities on $\theta$ are satisfied.
Code
can be found here as part of the minirank package, which
is my sandbox for code related to ranking and ordinal regression. At
some point I would like to submit it to scikit-learn but right now the
I don't know how the code will scale to medium-scale problems, but I
suspect not great. On top of that I'm not sure if there is a real demand
of these models for scikit-learn and I don't want to bloat the package
with unused features.
Performance
I compared the prediction accuracy of this model in the sense of mean absolute
error (IPython
notebook) on the boston
house-prices dataset. To have an ordinal variable, I
rounded the values to the closest integer, which gave me a problem of
size 506 $\times$ 13 with 46 different target values. Although not a
huge increase in accuracy, this model did give me better results on
this particular dataset:
Here, ordinal logistic regression is the best-performing model,
followed by a Linear Regression model and a One-versus-All Logistic
regression model as implemented in scikit-learn.
"Regression models for ordinal data", P. McCullagh, Journal of
the royal statistical society. Series B (Methodological), 1980 ↩↩
"Generalized Linear Models", P. McCullagh and J. A. Nelder (Book) ↩
"Loss Functions for Preference Levels : Regression with Discrete
Ordered Labels", Jason D. M. Rennie, Nathan Srebro ↩
This tutorial introduces the concept of pairwise preference used in most ranking problems. I'll use scikit-learn and for learning and matplotlib for visualization.
In the ranking setting, training data consists of lists of items with some order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item, so that for any two samples a and b, either a < b, b > a or b and a are not comparable.
For example, in the case of a search engine, our dataset consists of results that belong to different queries and we would like to only compare the relevance for results coming from the same query.
This order relation is usually domain-specific. For instance, in information retrieval the set of comparable samples is referred to as a "query id". The goal behind this is to compare only documents that belong to the same query (Joachims 2002). In medical imaging on the other hand, the order of the labels usually depend on the subject so the comparable samples is given by the different subjects in the study (Pedregosa et al 2012).
To start with, we'll create a dataset in which the target values consists of three graded measurements Y = {0, 1, 2} and the input data is a collection of 30 samples, each one with two features.
The set of comparable elements (queries in information retrieval) will consist of two equally sized blocks, $X = X_1 \cup X_2$, where each block is generated using a normal distribution with different mean and covariance. In the pictures, we represent $X_1$ with round markers and $X_2$ with triangular markers.
np.random.seed(0)theta=np.deg2rad(60)w=np.array([np.sin(theta),np.cos(theta)])K=20X=np.random.randn(K,2)y=[0]*Kforiinrange(1,3):X=np.concatenate((X,np.random.randn(K,2)+i*4*w))y=np.concatenate((y,[i]*K))# slightly displace data corresponding to our second partitionX[::2]-=np.array([3,7])blocks=np.array([0,1]*(X.shape[0]/2))# split into train and test setcv=cross_validation.StratifiedShuffleSplit(y,test_size=.5)train,test=iter(cv).next()X_train,y_train,b_train=X[train],y[train],blocks[train]X_test,y_test,b_test=X[test],y[test],blocks[test]# plot the resultidx=(b_train==0)pl.scatter(X_train[idx,0],X_train[idx,1],c=y_train[idx],marker='^',cmap=pl.cm.Blues,s=100)pl.scatter(X_train[~idx,0],X_train[~idx,1],c=y_train[~idx],marker='o',cmap=pl.cm.Blues,s=100)pl.arrow(0,0,8*w[0],8*w[1],fc='gray',ec='gray',head_width=0.5,head_length=0.5)pl.text(0,1,'$w$',fontsize=20)pl.arrow(-3,-8,8*w[0],8*w[1],fc='gray',ec='gray',head_width=0.5,head_length=0.5)pl.text(-2.6,-7,'$w$',fontsize=20)pl.axis('equal')pl.show()
In the plot we clearly see that for both blocks there's a common vector w such that the projection onto w gives a list with the correct ordering.
However, because linear considers that output labels live in a metric space it will consider that all pairs are comparable. Thus if we fit this model to the problem above it will fit both blocks at the same time, yielding a result that is clearly not optimal. In the following plot we estimate $\hat{w}$ using an l2-regularized linear model.
ridge=linear_model.Ridge(1.)ridge.fit(X_train,y_train)coef=ridge.coef_/linalg.norm(ridge.coef_)pl.scatter(X_train[idx,0],X_train[idx,1],c=y_train[idx],marker='^',cmap=pl.cm.Blues,s=100)pl.scatter(X_train[~idx,0],X_train[~idx,1],c=y_train[~idx],marker='o',cmap=pl.cm.Blues,s=100)pl.arrow(0,0,7*coef[0],7*coef[1],fc='gray',ec='gray',head_width=0.5,head_length=0.5)pl.text(2,0,'$\hat{w}$',fontsize=20)pl.axis('equal')pl.title('Estimation by Ridge regression')pl.show()
To assess the quality of our model we need to define a ranking score. Since we are interesting in a model that orders the data, it is natural to look for a metric that compares the ordering of our model to the given ordering. For this, we use Kendall's tau correlation coefficient, which is defined as (P - Q)/(P + Q), being P the number of concordant pairs and Q is the number of discordant pairs. This measure is used extensively in the ranking literature (e.g Optimizing Search Engines using Clickthrough Data).
We thus evaluate this metric on the test set for each block separately.
foriinrange(2):tau,_=stats.kendalltau(ridge.predict(X_test[b_test==i]),y_test[b_test==i])print('Kendall correlation coefficient for block %s: %.5f'%(i,tau))
Kendall correlation coefficient for block 0: 0.71122
Kendall correlation coefficient for block 1: 0.84387
The pairwise transform
As proved in (Herbrich 1999), if we consider linear ranking functions, the ranking problem can be transformed into a two-class classification problem. For this, we form the difference of all comparable elements such that our data is transformed into $(x'_k, y'_k) = (x_i - x_j, sign(y_i - y_j))$ for all comparable pairs.
This way we transformed our ranking problem into a two-class classification problem. The following plot shows this transformed dataset, and color reflects the difference in labels, and our task is to separate positive samples from negative ones. The hyperplane {x^T w = 0} separates these two classes.
# form all pairwise combinationscomb=itertools.combinations(range(X_train.shape[0]),2)k=0Xp,yp,diff=[],[],[]for(i,j)incomb:ify_train[i]==y_train[j] \
orblocks[train][i]!=blocks[train][j]:# skip if same target or different groupcontinueXp.append(X_train[i]-X_train[j])diff.append(y_train[i]-y_train[j])yp.append(np.sign(diff[-1]))# output balanced classesifyp[-1]!=(-1)**k:yp[-1]*=-1Xp[-1]*=-1diff[-1]*=-1k+=1Xp,yp,diff=map(np.asanyarray,(Xp,yp,diff))pl.scatter(Xp[:,0],Xp[:,1],c=diff,s=60,marker='o',cmap=pl.cm.Blues)x_space=np.linspace(-10,10)pl.plot(x_space*w[1],-x_space*w[0],color='gray')pl.text(3,-4,'$\{x^T w = 0\}$',fontsize=17)pl.axis('equal')pl.show()
As we see in the previous plot, this classification is separable. This will not always be the case, however, in our training set there are no order inversions, thus the respective classification problem is separable.
We will now finally train an Support Vector Machine model on the transformed data.
This model is known as RankSVM, although we note that the pairwise transform is more general and can be used together with any linear model. We will then plot the training data together with the estimated coefficient $\hat{w}$ by RankSVM.
Finally we will check that as expected, the ranking score (Kendall tau) increases with the RankSVM model respect to linear regression.
foriinrange(2):tau,_=stats.kendalltau(np.dot(X_test[b_test==i],coef),y_test[b_test==i])print('Kendall correlation coefficient for block %s: %.5f'%(i,tau))
Kendall correlation coefficient for block 0: 0.83627
Kendall correlation coefficient for block 1: 0.84387
This is indeed higher than the values (0.71122, 0.84387) obtained in the case of linear regression.
Original ipython notebook for this blog post can be found here
"Large Margin Rank Boundaries for Ordinal Regression", R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers, 115-132, Liu Press, 2000 ↩
"Optimizing Search Engines Using Clickthrough Data", T. Joachims. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. ↩
"Learning to rank from medical imaging data", Pedregosa et al. [arXiv] ↩
"Efficient algorithms for ranking with SVMs", O. Chapelle and S. S. Keerthi, Information Retrieval Journal, Special Issue on Learning to Rank, 2009 ↩