How to Evaluate the Logistic Loss and not NaN trying
Category: optimization
#logistic regression #Python #NumPy
A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. In this post I take a closer look into the source of these instabilities and discuss more robust Python implementations.
Logistic regression
I'll consider the logistic regression loss, defined as
\begin{equation}\label{eq:logloss}
f(\xx) \defas \frac{1}{n}\sum_{i=1}^n - b_i \log(s(\aa_i^T \xx)) - (1 - b_i) \log(1 - s(\aa_i^T \xx))~,
\end{equation}
where each $\aa_i$ is a $d$-dimensional vector, $b_i$ is a scalar between 0 and 1,
and $s(t) \defas 1 / (1 + \exp(-t))$ is the sigmoid function.
Below is a straightforward implementation of the previous equation, where $A$ is the row-wise stacking of $\aa_i$:
It turns out that this simple and innocent-looking implementation will overflow for moderately large values of the input.
This can be a real problem for example when training a machine learning model that uses this loss function.
This happens because the two terms in 1 + np.exp(-z) are on very different scales. exp(-40) is of the order of $10^{-18}$. Since double precision can hold up to 17 significant decimal digits, the expression 1 + np.exp(-40) gets truncated to 1, a phenomenon known as round-off error in numerical analysis.
Second, the exponential function will overflow for large values of the input. For double precision types (float64 in NumPy jargon), $e^t$ overflows for values $t \geq 710$, while for single precision types (float32) this happens for $t \geq 89$.
A closer look at the log-sigmoid
Log-sigmoid function in the interval [-5, 5]. The issues encoutered before happen during the evaluation of the log-sigmoid function $\log(s(\cdot))$. Therefore we take a closer look into this function and examine its accuracy. We will compare the following 3 different implementations:
1. naive. Directly evaluating $\log(s(t))$ as in the previous implementation. In code, this corresponds to
2. logsumexp. Scipy's logsumexp provides a stable computation of the expression \begin{equation} \varphi(\bb) = \log\left(\sum^{d}_{i=1} \exp(b_i)\right)~. \end{equation} We can reuse this function to compute the log-sigmoid through the identity \begin{align} \log(s(t)) &= \log(1/(1 + \exp(-t)))\\ &= -\log(1 + \exp(-t))\\ &= - \varphi([0, -t])~. \end{align} This suggests the following implementation:
3. log1pexp. Martin Mächler wrote a note in 2012 on how to compute accurately compute the log-sigmoid function in the R language.
We can now compare relative accuracy of these approaches. In th plot below, we show the relative accuracy (lower is better) a function of the input.
Relative precision for the logistic loss computation (lower is better). While most methods do reasonably well for negative values, the precision quickly deteriorates in the positive regime. Values above 1 indicate that the routine gave NaN.
A more stable implementation of the logistic loss
With the results from the previous section we can now write a more stable version of the logistic loss.
One last trick that we will use is to use the formula $\log(1 - s(z)) = -z + \log(s(z))$ to simplify slightly \eqref{eq:logloss} so that it becomes \begin{equation}\label{eq:logistic2} f(\xx) = \frac{1}{n}\sum_{i=1}^n (1 - b_i) \aa_i^T \xx - \log(s(\aa_i^T \xx))~. \end{equation}
And finally, here is the full Python implementation. The routine logsig is slighly more convoluted that the ones presented before because it performs the previous routine component-wise without for loops for efficiency on large arrays:
In the same example as before, this implementation does not overflow anymore:
Gradient
For optimization, having an accurate evaluation the gradient is crucial to achieve a solution with high accuracy. In this section I'll compare different approaches to compute this gradient.
Computation of the gradient of the logistic loss simplifies considerably thanks to the following identity verified by the sigmoid function: \begin{equation}\label{eq:diffeq_sigma} \frac{d}{dt}s(t) = s(t)(1 - s(t))~. \end{equation} In particular, this implies that \begin{equation} \label{eq:dif_log} \frac{d}{dt}\log(s(t)) = \frac{\frac{d}{dt}s(t)}{s(t)} = 1 - s(t) \end{equation}
Using these identities and \eqref{eq:logistic2}, we can write the gradient of the logistic loss as \begin{equation} \label{eq:gradient} \nabla f(\xx) =~ \frac{1}{n}\sum_{i=1}^n\aa_i (s(\aa_i^T \xx) - b_i) \end{equation}
As we will see, there are also some numerical issues with the computation of $b_i - s(\cdot)$ that can be problematic when computing this expression. We will now compare three different implementations of $g(t) = s(t) - b$:
1. naive. The previous equation is easy to implement using SciPy's sigmoid function (special.expit):
Unfortunately, the substraction of $b_i$ can cause a problematic loss of precision when this is close to 1:
while the correct answer is $\approx -4.24 \times 10^{-18}$.
2. expitb_v2. We can avoid precision loss of the previous example by reformulating the problematic expression into a single fraction as \begin{align} b - s(t) = b - \frac{1}{1 + e^{-t}} &= \frac{b e^{-t} - 1 + b}{1 + e^{-t}}\\ &= \frac{b - (1 - b)e^t}{e^t + 1}~.\label{eq:expit_b} \end{align} In this case, when $b=1$ the expression becomes $1 / (e^t + 1)$, which doesn't suffer from the same issue.
Which as advertised gives a more accurate answer in the previous usercase:
While better than the naive approach, this routine sufferes from some issues. For example, the exponential will overflow for large values of $t$ (starting around 800), with this routine returning NaN:
3. expitb_sign. We can fix this last problem by avoiding to evaluate the exponential function with a large input. For this, we multiply both sides of \eqref{eq:expit_b} by $e^{-t}$ to arrive at the equivalent expression $((1 - b) - b e^{-t}) / (1 + e^{-t})$, which does not have any terms on $e^t$. Finally, we choose this formula whenever $t> 0$ and the previous one otherwise. This gives the following code:
We can now compare relative accuracy of these approaches. In the plot below, we show the relative accuracy (lower is better) a function of the input.
Relative precision for different implementations of the logistic loss's gradient (lower is better).The naive method quickly suffers from relative of precision in the positive segment. expit_b exhibits a better accuracy but outputs NaN for large values of the input (values above 1 indicate NaN). expit_sign has none of these issues and has the best overall accuracy.
Implementation of the gradient.
We can now use this to compute the full gradient: