fa.bianp.net

How to Evaluate the Logistic Loss and not NaN trying

Category: optimization
#logistic regression #Python #NumPy

By Fabian Pedregosa and Bart van Merriënboer (Google Research).

A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. In this post I take a closer look into the source of these instabilities and discuss more robust Python implementations.

$$ \def\aa{\boldsymbol a} \def\bb{\boldsymbol b} \def\dd{\boldsymbol d} \def\cc{\boldsymbol c} \def\xx{\boldsymbol x} \def\zz{\boldsymbol z} \def\uu{\boldsymbol u} \def\vv{\boldsymbol v} \def\yy{\boldsymbol y} \def\ss{\boldsymbol s} \def\pp{\boldsymbol p} \def\RR{\mathbb{R}} \def\TT{\boldsymbol T} \def\CC{\boldsymbol C} \def\Econd{\boldsymbol E} \DeclareMathOperator*{\argmin}{{arg\,min}} \DeclareMathOperator*{\argmax}{{arg\,max}} \DeclareMathOperator*{\minimize}{{minimize}} \DeclareMathOperator*{\dom}{\mathbf{dom}} \DeclareMathOperator*{\Fix}{\mathbf{Fix}} \DeclareMathOperator{\prox}{\mathbf{prox}} \def\defas{\stackrel{\text{def}}{=}} $$

Logistic regression

I'll consider the logistic regression loss, defined as \begin{equation}\label{eq:logloss} f(\xx) \defas \frac{1}{n}\sum_{i=1}^n - b_i \log(s(\aa_i^T \xx)) - (1 - b_i) \log(1 - s(\aa_i^T \xx))~, \end{equation} where each $\aa_i$ is a $d$-dimensional vector, $b_i$ is a scalar between 0 and 1, and $s(t) \defas 1 / (1 + \exp(-t))$ is the sigmoid function.Some references use the following slightly different formulation \begin{equation} \label{eq:logloss2} \frac{1}{n}\sum_{i=1}^n \log(b_i \aa_i^T \xx))~, \end{equation} where $b_i$ only takes the values -1 or 1 instead of being in the range $[0, 1]$. In this case the formulation simplifies to This is the formulation that appears for example in the scikit-learn documentation. This function has many interpretations that I will not go into here, as my main focus will be the numerical issues that arise when evaluating this expression.

Below is a straightforward implementation of the previous equation, where $A$ is the row-wise stacking of $\aa_i$:

import numpy as np


def f_naive(x, A, b):
    """Logistic loss, naive implementation."""
    z = np.dot(A, x)
    b = np.asarray(b)
    tmp = 1. / (1 + np.exp(-z))
    return np.mean(- b * np.log(tmp) - (1 - b) * np.log(1 - tmp))

It turns out that this simple and innocent-looking implementation will overflow for moderately large values of the input. This can be a real problem for example when training a machine learning model that uses this loss function.Luckily, most machine learning libraries provide more sophisticated implementations that can safeguard against this indeterminacy. The first time that I became aware of these tricks were through the liblinear package around 2010, although the trick was most likely known and used even before. In this library, the implementation of the logistic loss contains a distinction of cases to safeguard from the indeterminacy. For example, if we try the above code with inputs (say) $A = [[1, 1]], b=[1]$ and $x=[20, 20]$ we get the dreadful Not-A-Number (NaN):

>>> A = [[1, 1]]
>>> b = [1]
>>> f_naive([20, 20], A, b)
RuntimeWarning: divide by zero encountered in log
RuntimeWarning: invalid value encountered in multiply
nan

This happens because the two terms in 1 + np.exp(-z) are on very different scales. exp(-40) is of the order of $10^{-18}$. Since double precision can hold up to 17 significant decimal digits, the expression 1 + np.exp(-40) gets truncated to 1, a phenomenon known as round-off error in numerical analysis.

Second, the exponential function will overflow for large values of the input. For double precision types (float64 in NumPy jargon), $e^t$ overflows for values $t \geq 710$, while for single precision types (float32) this happens for $t \geq 89$.

A closer look at the log-sigmoid

Log-sigmoid function in the interval [-5, 5]. The issues encoutered before happen during the evaluation of the log-sigmoid function $\log(s(\cdot))$. Therefore we take a closer look into this function and examine its accuracy. We will compare the following 3 different implementations:

1. naive. Directly evaluating $\log(s(t))$ as in the previous implementation. In code, this corresponds to

import numpy as np

def logsig_naive(t):
    return np.log(1 / (1 + np.exp(-t)))

2. logsumexp. Scipy's logsumexp provides a stable computation of the expression \begin{equation} \varphi(\bb) = \log\left(\sum^{d}_{i=1} \exp(b_i)\right)~. \end{equation} We can reuse this function to compute the log-sigmoid through the identity \begin{align} \log(s(t)) &= \log(1/(1 + \exp(-t)))\\ &= -\log(1 + \exp(-t))\\ &= - \varphi([0, -t])~. \end{align} This suggests the following implementation:

import numpy as np
from scipy import special

def logsig_logsumexp(t):
    return - special.logsumexp([0, -t])

3. log1pexp. Martin Mächler wrote a note in 2012 on how to compute accurately compute the log-sigmoid function in the R language. His technique is based on using different approximations for different values of the input and uses the np.log1p function, which computes $\log(1 + x)$ for small values of $x$. This implementation of the log-sigmoid function computes \begin{equation} \log(s(t)) = \begin{cases} t & t < -33.3\\ t - \exp(t) & -33.3 \leq t \leq -18 \\ -\text{log1p}(\exp(-t)) & -18 \leq t \leq 37 \\ -\exp(-t) & 37 \leq t\\ \end{cases}~. \end{equation} A Python implementation of this could be

import numpy as np

def logsig_log1pexp(t):
    if t < -33.3:
        return t
    elif t <= -18:
        return t - np.exp(t)
    elif t <= 37:
        return -np.log1p(np.exp(-t))
    else:
        return -np.exp(-t)

We can now compare relative accuracy of these approaches. In th plot below, we show the relative accuracy (lower is better) a function of the input.

Relative precision for the logistic loss computation (lower is better). While most methods do reasonably well for negative values, the precision quickly deteriorates in the positive regime. Values above 1 indicate that the routine gave NaN.
Open In Colab

A more stable implementation of the logistic loss

With the results from the previous section we can now write a more stable version of the logistic loss.

One last trick that we will use is to use the formula $\log(1 - s(z)) = -z + \log(s(z))$ to simplify slightly \eqref{eq:logloss} so that it becomes \begin{equation}\label{eq:logistic2} f(\xx) = \frac{1}{n}\sum_{i=1}^n (1 - b_i) \aa_i^T \xx - \log(s(\aa_i^T \xx))~. \end{equation}

And finally, here is the full Python implementation. The routine logsig is slighly more convoluted that the ones presented before because it performs the previous routine component-wise without for loops for efficiency on large arrays:

import numpy as np
from scipy import special


def logsig(x):
    """Compute the log-sigmoid function component-wise."""
    out = np.zeros_like(x)
    idx0 = x < -33
    out[idx0] = x[idx0]
    idx1 = (x >= -33) & (x < -18)
    out[idx1] = x[idx1] - np.exp(x[idx1])
    idx2 = (x >= -18) & (x < 37)
    out[idx2] = -np.log1p(np.exp(-x[idx2]))
    idx3 = x >= 37
    out[idx3] = -np.exp(-x[idx3])
    return out


def f(x, A, b):
    """Logistic loss, numerically stable implementation.
    
    Parameters
    ----------
    A: array-like, shape (n_samples, n_features)
        Data matrix

    b: array-like, shape (n_samples,)
        Labels

    x: array-like, shape (n_features,)
        Coefficients

    Returns
    -------
    loss: float
    """
    z = np.dot(A, x)
    b = np.asarray(b)
    return np.mean((1 - b) * z - logsig(z))

In the same example as before, this implementation does not overflow anymore:

>>> A = [[1, 1]]
>>> b = [1]
>>> f([20, 20], A, b)
0.0

Gradient

For optimization, having an accurate evaluation the gradient is crucial to achieve a solution with high accuracy. In this section I'll compare different approaches to compute this gradient.

Computation of the gradient of the logistic loss simplifies considerably thanks to the following identity verified by the sigmoid function: \begin{equation}\label{eq:diffeq_sigma} \frac{d}{dt}s(t) = s(t)(1 - s(t))~. \end{equation} In particular, this implies that \begin{equation} \label{eq:dif_log} \frac{d}{dt}\log(s(t)) = \frac{\frac{d}{dt}s(t)}{s(t)} = 1 - s(t) \end{equation}

Using these identities and \eqref{eq:logistic2}, we can write the gradient of the logistic loss as \begin{equation} \label{eq:gradient} \nabla f(\xx) =~ \frac{1}{n}\sum_{i=1}^n\aa_i (s(\aa_i^T \xx) - b_i) \end{equation}

As we will see, there are also some numerical issues with the computation of $b_i - s(\cdot)$ that can be problematic when computing this expression. We will now compare three different implementations of $g(t) = s(t) - b$:

1. naive. The previous equation is easy to implement using SciPy's sigmoid function (special.expit):

from scipy import special

def expitb_naive(t, b):
    return special.expit(t) - b
  

Unfortunately, the substraction of $b_i$ can cause a problematic loss of precision when this is close to 1:

>>> expitb_naive(40, 1)
0.0
  

while the correct answer is $\approx -4.24 \times 10^{-18}$.

2. expitb_v2. We can avoid precision loss of the previous example by reformulating the problematic expression into a single fraction as \begin{align} b - s(t) = b - \frac{1}{1 + e^{-t}} &= \frac{b e^{-t} - 1 + b}{1 + e^{-t}}\\ &= \frac{b - (1 - b)e^t}{e^t + 1}~.\label{eq:expit_b} \end{align} In this case, when $b=1$ the expression becomes $1 / (e^t + 1)$, which doesn't suffer from the same issue.

import numpy as np

def expitb_v2(t, b):
    exp_t = np.exp(t)
    return ((1 - b) * exp_t - b) / (1 + exp_t)
  

Which as advertised gives a more accurate answer in the previous usercase:

>>> expitb_v2(40, 1)
-4.248354255291589e-18
  

While better than the naive approach, this routine sufferes from some issues. For example, the exponential will overflow for large values of $t$ (starting around 800), with this routine returning NaN:

>>> expitb_compact(800, 1)
RuntimeWarning: overflow encountered in exp
RuntimeWarning: invalid value encountered in double_scalars
nan
    

3. expitb_sign. We can fix this last problem by avoiding to evaluate the exponential function with a large input. For this, we multiply both sides of \eqref{eq:expit_b} by $e^{-t}$ to arrive at the equivalent expression $((1 - b) - b e^{-t}) / (1 + e^{-t})$, which does not have any terms on $e^t$. Finally, we choose this formula whenever $t> 0$ and the previous one otherwise. This gives the following code:

import numpy as np

def expitb_sign(t, b):
    if t < 0:
        exp_t = np.exp(t)
        return ((1 - b) * exp_t - b) / (1 + exp_t)
    else:
        exp_nt = np.exp(-t)
        return ((1 - b) - b * exp_nt) / (1 + exp_nt)

We can now compare relative accuracy of these approaches. In the plot below, we show the relative accuracy (lower is better) a function of the input.

Relative precision for different implementations of the logistic loss's gradient (lower is better).The naive method quickly suffers from relative of precision in the positive segment. expit_b exhibits a better accuracy but outputs NaN for large values of the input (values above 1 indicate NaN). expit_sign has none of these issues and has the best overall accuracy.
Open In Colab

Implementation of the gradient.

We can now use this to compute the full gradient: A full implementation of the logistic loss and its gradients that also accepts sparse matrices can be found in the optimization package copt.

import numpy as np

def expit_b(self, x, b):
    """Compute sigmoid(x) - b component-wise."""
    idx = x < 0
    out = np.zeros_like(x)
    exp_x = np.exp(x[idx])
    b_idx = b[idx]
    out[idx] = ((1 - b_idx) * exp_x - b_idx) / (1 + exp_x)
    exp_nx = np.exp(-x[~idx])
    b_nidx = b[~idx]
    out[~idx] = ((1 - b_nidx) - b_nidx * exp_nx) / (1 + exp_nx)
    return out


def f_grad(x, A, b):
    """Computes the gradient of the logistic loss.
    
    Parameters
    ----------
    A: array-like, shape (n_samples, n_features)
        Data matrix

    b: array-like, shape (n_samples,)
        Labels

    x: array-like, shape (n_features,)
        Coefficients

    Returns
    -------
    grad: array-like, shape (n_features,)    
    """
    z = A.dot(x)
    s = expit_b(z, b)
    return A.T.dot(s) / A.shape[0] 

References