Numerical optimizers for Logistic Regression

Category: misc
#machine learning #logistic regression #Python #SciPy

In this post I compar several implementations of Logistic Regression. The task was to implement a Logistic Regression model using standard optimization tools from scipy.optimize and compare them against state of the art implementations such as LIBLINEAR.

In this blog post I'll write down all the implementation details of this model, in the hope that not only the conclusions but also the process would be useful for future comparisons and benchmarks.

Function evaluation

We consider the case in which the decision function is an affine function, i.e., $f(x) = \langle x, w \rangle + c$ where $w$ and $c$ are parameters to estimate. The loss function for the $\ell_2$-regularized logistic regression, i.e. the function to be minimized is

$$ \mathcal{L}(w, \lambda, X, y) = - \frac{1}{n}\sum_{i=1}^n \log(\phi(y_i (\langle X_i, w \rangle + c))) + \frac{\lambda}{2} w^T w $$

where $\phi(t) = 1. / (1 + \exp(-t))$ is the logistic function, $\lambda w^T w$ is the regularization term and $X, y$ is the input data, with $X \in \mathbb{R}^{n \times p}$ and $y \in \{-1, 1\}^n$. However, this formulation is not great from a practical standpoint. Even for not unlikely values of $t$ such as $t = -100$, $\exp(100)$ will overflow, assigning the loss an (erroneous) value of $+\infty$. For this reason 1, we evaluate $\log(\phi(t))$ as

$$ \log(\phi(t)) = \begin{cases} - \log(1 + \exp(-t)) \text{ if } t > 0 \\ t - \log(1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases} $$

The gradient of the loss function is given by

$$\begin{aligned} \nabla_w \mathcal{L} &= \frac{1}{n}\sum_{i=1}^n y_i X_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) + \lambda w \\ \nabla_c \mathcal{L} &= \sum_{i=1}^n y_i (\phi(y_i (\langle X_i, w \rangle + c)) - 1) \end{aligned}$$

Similarly, the logistic function $\phi$ used here can be computed in a more stable way using the formula

$$ \phi(t) = \begin{cases} 1 / (1 + \exp(-t)) \text{ if } t > 0 \\ \exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases} $$

Finally, we will also need the Hessian for some second-order methods, which is given by

$$ \nabla_w ^2 \mathcal{L} = X^T D X + \lambda I $$

where $I$ is the identity matrix and $D$ is a diagonal matrix given by $D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i))$.

In Python, these function can be written as

import numpy as np

def phi(t):
    # logistic function, returns 1 / (1 + exp(-t))
    idx = t > 0
    out = np.empty(t.size, dtype=np.float)
    out[idx] = 1. / (1 + np.exp(-t[idx]))
    exp_t = np.exp(t[~idx])
    out[~idx] = exp_t / (1. + exp_t)
    return out

def loss(x0, X, y, alpha):
    # logistic loss function, returns Sum{-log(phi(t))}
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    yz = y * z
    idx = yz > 0
    out = np.zeros_like(yz)
    out[idx] = np.log(1 + np.exp(-yz[idx]))
    out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
    out = out.sum() / X.shape[0] + .5 * alpha * w.dot(w)
    return out

def gradient(x0, X, y, alpha):
    # gradient of the logistic loss
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    z = phi(y * z)
    z0 = (z - 1) * y
    grad_w = X.T.dot(z0) / X.shape[0] + alpha * w
    grad_c = z0.sum() / X.shape[0]
    return np.concatenate((grad_w, [grad_c]))


I tried several methods to estimate this $\ell_2$-regularized logistic regression. There is one first-order method (that is, it only makes use of the gradient and not of the Hessian), Conjugate Gradient whereas all the others are Quasi-Newton methods. The method I tested are:

To assure the most accurate results across implementations, all timings were collected by callback functions that were called from the algorithm on each iteration. Finally, I plot the maximum absolute value of the gradient (=the infinity norm of the gradient) with respect to time.

The synthetic data used in the benchmarks was generated as described in 2 and consists primarily of the design matrix $X$ being Gaussian noise, the vector of coefficients is drawn also from a Gaussian distribution and the explained variable $y$ is generated as $y = \text{sign}(X w)$. We then perturb matrix $X$ by adding Gaussian noise with covariance 0.8. The number of samples and features was fixed to $10^4$ and $10^3$ respectively. The penalization parameter $\lambda$ was fixed to 1.

In this setting variables are typically uncorrelated and most solvers perform decently:

Benchmark Logistic

Here, the Trust Region and L-BFGS solver perform almost equally good, with Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.

To make things more interesting, we now make the design to be slightly more correlated. We do so by adding a constant term of 1 to the matrix $X$ and adding also a column vector of ones this matrix to account for the intercept. These are the results:

Benchmark Logistic

Here, we already see that second-order methods dominate over first-order methods (well, except for BFGS), with Trust Region clearly dominating the picture but with TNC not far behind.

Finally, if we force the matrix to be even more correlated (we add 10. to the design matrix $X$), then we have:

Benchmark Logistic

Here, the Trust-Region method has the same timing as before, but all other methods have got substantially worse.The Trust Region method, unlike the other methods is surprisingly robust to correlated designs.

To sum up, the Trust Region method performs extremely well for optimizing the Logistic Regression model under different conditionings of the design matrix. The LIBLINEAR software uses this solver and thus has similar performance, with the sole exception that the evaluation of the logistic function and its derivatives is done in C++ instead of Python. In practice, however, due to the small number of iterations of this solver I haven't seen any significant difference.

  1. A similar development can be found in the source code of LIBLINEAR, and is probably also used elsewhere. 

  2. "A comparison of numerical optimizers for logistic regression", P. Minka, URL 

  3. "Newton's Method for Large Bound-Constrained Optimization Problems", Chih-Jen Lin, Jorge J. More URL 

  4. IPython Notebook to reproduce the benchmarks source 

Singular Value Decomposition in SciPy

Category: misc
#python #scipy #svd

SciPy contains two methods to compute the singular value decomposition (SVD) of a matrix: scipy.linalg.svd and scipy.sparse.linalg.svds. In this post I'll compare both methods for the task of computing the full SVD of a large dense matrix.

The first method, scipy.linalg.svd, is perhaps the best known and uses the linear algebra library LAPACK to handle the computations. This implements the Golub-Kahan-Reisch algorithm 1, which is accurate and highly efficient with a cost of O(n^3) floating-point operations 2.

The second method is scipy.sparse.linalg.svds and despite it's name it works fine also for dense arrays. This implementation is based on the ARPACK library and consists of an iterative procedure that finds the SVD decomposition by reducing the problem to an eigendecomposition on the array X -> dot(X.T, X). This method is usually very effective when the input matrix X is sparse or only the largest singular values are required. There are other SVD solvers that I did not consider, such as sparsesvd or pysparse.jdsym, but my points for the sparse solve probably hold for those packages too since they both implement iterative algorithms based on the same principles.

When the input matrix is dense and all the singular values are required, the first method is usually more efficient. To support this statement I've created a little benchmark: timings for both methods as a function of the size of the matrices. Notice that we are in a case that is clearly favorable to the linalg.svd: after all sparse.linalg.svds was not created with this setting in mind, it was created for sparse matrices or dense matrices with some special structure. We will see however that even in this setting it has interesting advantages.

I'll create random square matrices with different sizes and plot the timings for both methods. For the benchmarks I used SciPy v0.12 linked against Intel Math Kernel Library v11. Both methods are single-threaded (I had to set OMP_NUM_THREADS=1 so that MKL does not try to parallelize the computations). [code]

svd timings

Lower timings are better, so this gives scipy.linalg.svd as clear winner. However, this is just part of the story. What this graph doesn't show is that this method is winning at the price of allocating a huge amount of memory for temporary computations. If we now plot the memory consumption for both methods under the same settings, the story is completely different. [code]

svd memory

The memory requirements of scipy.linalg.svd scale with the number of dimensions, while for the sparse version the amount of allocated memory is constant. Notice that we are measuring the amount of total memory used, it is thus natural to see a slight increase in memory consumption since the input matrix is bigger on each iteration.

For example, in my applications, I need to compute the SVD of a matrix for whom the needed workspace does not fit in memory. In cases like this, the sparse algorithm (sparse.linalg.svds) can come in handy: the timing is just a factor worse (but I can easily parallelize jobs) and the memory requirements for this method is peanuts compared to the dense version.

  1. Calculating the singular values and pseudo-inverse of a matrix, Golub, Gene H., Kahan, William, 1965, JSTOR 

  2. A Survey of Singular Value Decomposition Methods and Performance Comparison of Some Available Serial Codes, Plassman, Gerald E. 2005 PDF 

qr_multiply function in scipy.linalg

Category: misc
#python #scipy

In scipy's development version there's a new function closely related to the QR-decomposition of a matrix and to the least-squares solution of a linear system. What this function does is to compute the QR-decomposition of a matrix and then multiply the resulting orthogonal factor by another arbitrary matrix. In pseudocode:

def qr_multiply(X, Y):
    Q, R = qr(X)
    return dot(Q.T, Y)

but unlike this naive implementation, qr_multiply is able to do all this without explicitly computing the orthogonal Q matrix, resulting both in memory and time saving. In the following picture I measured the memory consumption as a function of time of running this computation on a 1.000 x 1.000 matrix X and a vector Y (full code can be found here):

It can be seen that not only qr_multiply is almost twice as fast as the naive approach, but also that the memory consumption is significantly reduced, since the orthogonal factor is never explicitly computed. Credit for implementing the qr_multiply function goes to Martin Teichmann.

Ridge regression path

Category: misc
#scikit-learn #scipy #linear algebra

Ridge coefficients for multiple values of the regularization parameter can be elegantly computed by updating the thin SVD decomposition of the design matrix:

import numpy as np
from scipy import linalg
def ridge(A, b, alphas):
    Return coefficients for regularized least squares

         min ||A x - b||^2 + alpha ||x||^2

    A : array, shape (n, p)
    b : array, shape (n,)
    alphas : array, shape (k,)

    coef: array, shape (p, k)
    U, s, Vt = linalg.svd(X, full_matrices=False)
    d = s / (s[:, np.newaxis].T ** 2 + alphas[:, np.newaxis])
    return np.dot(d * U.T.dot(y), Vt).T

This can be used to efficiently compute what it's regularization path, that is, to plot the coefficients as a function of the regularization parameter. Since the bottleneck of the algorithm is the singular value decomposition, computing the coefficients for other values of the regularization parameter basically comes for free.

A variant of this algorithm can then be used to compute the optimal regularization parameter in the sense of leave-one-out cross-validation and is implemented in scikit-learn's RidgeCV (for which Mathieu Blondel has an excelent post by ). This optimal parameter is denoted with a vertical dotted line in the following picture, full code can be found here.

Computing the vector norm

Category: misc
#linear algebra #norm #scipy

Update: a fast and stable norm was added to scipy.linalg in August 2011 and will be available in scipy 0.10 Last week I discussed with Gael how we should compute the euclidean norm of a vector a using SciPy. Two approaches suggest themselves, either calling scipy.linalg.norm(a) or computing sqrt(a.T a), but as I learned later, both have issues. Note: I use single-precision arithmetic for simplicity, but similar results hold for double-precision.

Overflow and underflow

Both approaches behave terribly in presence of big or small numbers. Take for example an array with a single entry:

In [0]: a = np.array([1e20], dtype=np.float32)
In [1]: a
Out[1]: array([1.00000002e+20], dtype=float32)
In [2]: scipy.linalg.norm(a)
Out[2]: inf
In [3]: np.sqrt(np.dot(a.T, a))
Out[3]: inf

That is, both methods return Infinity. However, the correct answer is 10^20, which would comfortably fit in a single-precision instruction. Similar examples can be found where numbers underflow.


Again, scipy.linalg.norm has a terrible behavior in what concerns numerical stability. In presence of different magnitudes severe cancellation can occur. Take for example and array with one 10.000 in the first value and 10.000 ones behind:

a = np.array([1e4] + [1]*10000, dtype=np.float32)

In this case, scipy.linalg.norm will discard all the ones, producing

In [3]: linalg.norm(a) - 1e4
Out[3]: 0.0

when the correct answer is 0.5. In this case $\sqrt{a^T a}$ has a much nicer behavior since results of a dot-product in single precision are accumulated using double-precision (but if double-precision is used, results won't be accumulated using quadruple-precision):

In [4]: np.sqrt(np.dot(a.T, a)) - 1e4
Out[4]: 0.5


The BLAS function nrm2 does automatic scaling of parameters rendering it more stable and tolerant to overflow. Luckily, scipy provides a mechanism to call some BLAS functions:

In [5]: nrm2, = scipy.linalg.get_blas_funcs(('nrm2',), (a,))

Using this function, no overflow occurs (hurray!)

In [95]: a = np.array([1e20], dtype=np.float32)
In [96]: nrm2(a)
Out[96]: 1.0000000200408773e+20

and stability is greatly improved

In [99]: nrm2(a) - 1e4
Out[99]: 0.49998750062513864

Update: as of scipy 0.10, this function is used by scipy.linalg.norm .


Computing the 2-norm of an array is a very cheap operation, thus computations are usually dominated by external factors, such as latency of memory access or overhead in the Python/C layer. Experimental benchmarks on an array of size 10^7 show that nrm2 is marginally slower than $latex \sqrt{a^T a}$, because scaling has a cost, but is is also more stable and less prone to overflow and underflow. It also shows that scipy.linalg.norm is the slowest (and numerically worst!) of all.

$\sqrt{a^T a}$ BLAS nrm2(a) scipy.linalg.norm(a)
0.02 0.02 0.16