Entropy as a measure of surprise

I’ve been musing for a while about the role of entropy in machine learning and, more generally, in statistics, particularly as a way to describe probability distributions. More recently, I stumbled upon this blog post on interpretations of the Kullback-Leibler divergence $K(P || Q)$ which really triggered my curiosity. This post is a way for me to shape my thoughts on the topic and eventually to achieve a better understanding of how to use entropy-related quantities as metrics for model evaluation and selection.

Context

Before I throw a lot of math your way, I figured I would try to paint a picture as to when any of this is relevant. Suppose you are a data scientist for a restaurant chain specializing in homemade italian pasta, and your task is to build a system to assign service labels (Delivery, Collect) to orders. Of course you have plenty of historical data. If you follow the data science checklist, you will probably start with a simple logistic regression model on your sales data, based on the content of the customer’s basket. To know whether your model is good or not, you’ll probably use metrics like accuracy, precision and recall on some holdout set.

If your goal is to build some real-time prediction model to help allocating delivery drivers, then single-order correctness is probably what you are most interested in, and these metrics will do the job well. But what if instead you want this model to be used for simulation and forecasting? For example, you want to optimise the total number of delivery drivers you need to hire, and to do that you will simulate the next X months of orders. In that case, the allocation of a single order may not be as important as the overall distributions of your orders’ attributes per service. Over large batches, the distribution of weight and price of your orders will matter more than whether each order is perfectly classified or not. In that case, you will want a way to compare distributions between your historical and simulated data. Achieving this is exactly what I want to talk about here.

KL divergence, PSI and entropy

There are two common ways to compare two probability distributions $P$ and $Q$ : the KL divergence $K(P || Q)$ and the population stability index (PSI) $\mathrm{PSI}(P, Q)$ . At a first glance, they look quite similar

\begin{aligned} K(P || Q) &= \sum_x P(x) \log \frac{P(x)}{Q(x)}, \\ \mathrm{PSI}(P, Q) &= \sum_x (P(x) - Q(x)) \log \frac{P(x)}{Q(x)}, \end{aligned}

only differing in the factor in front of the logarithm. But this difference is crucial: the PSI is symmetric between both distributions, while the KL divergence treats $P$ in a special manner, often interpreted as a “true” distribution. But to make sense of both of these concepts, I need to bring a more fundamental quantity into the picture: the entropy of a distribution $P$ defined as

H(P) = -\sum_x P(x) \log P(x).

The entropy gets its name from the eponymous physics quantity used in thermodynamics and statistical physics. There are deep connections between the two which I won’t get into here. From a statistics perspective, all you need to know is that entropy is a measure of the uncertainty carried by a distribution. As a simple example, the entropy is maximized for a uniform distribution $U$ :¹ the uniform distribution assigns equal weight to every event, thereby carrying no information whatsoever about which event is more likely to happen, and therefore has the highest uncertainty.

One thing I learnt from the blog post I linked earlier was that the $-\log P$ portion of the entropy is also called the surprise of an event under a distribution. Defining $I_P(x) \equiv -\log P(x)$ , the surprise $I$ is zero for certain events, and increases as the probability decreases. Of course! The least likely an event, the more surprised we are to see it happen. The reason to choose the logarithm is that it also satisfies the property that the surprise of two independent events happening together is the sum of their individual surprises, so $I_P(x, y) = I_P(x) + I_P(y)$ .² Using this formalism, the entropy can be rewritten as the expected surprise of a random variable $X$ distributed according to $P$ , so $H(P) = \mathbb{E}_P [I_P(X)]$ .

With this in mind, the KL divergence and the PSI can be rewritten as

\begin{aligned} K(P || Q) &= \mathbb E_P[I_Q(X)] - \mathbb E_P [I_P(X)], \\ \mathrm{PSI}(P, Q) &= K(P || Q) + K(Q || P). \end{aligned}

In these formulae, you can recognise the entropies of both distributions $P$ and $Q$ , but also some cross-terms involving both distributions. These are the natural generalisation of the entropy to a cross-entropy between two distributions $-\sum_x p(x) \log q(x)$ , which is just the average surprise when observing the random variable $X \sim P$ while assuming $X \sim Q$ . This contribution is dominated by events for which $Q$ is small, but $P \simeq 1$ : this means your model is confidently wrong, estimating these likely events to be unlikely.

To summarize, the KL divergence is the average excess surprise when using $Q$ instead of $P$ as an estimate of the true distribution, while the PSI is essentially the symmetric version of the KL divergence. One way to think about the PSI is to notice that, unlike the KL divergence, it is sensitive to both cases where $Q \ll 1, P \simeq 1$ as well as $Q \simeq 1, P \ll 1$ , whereas the KL divergence is biased towards the former. So if you use the KL divergence to measure the distance between distributions, don’t forget to account for the fact that unlikely events being misallocated will not be penalized as harshly. In a lot of cases, this is actually a good thing!

Let’s illustrate this with our takeaway example! Suppose your orders can fall into three buckets: small (one pasta dish, one drink), medium (a meal for two), large (the nearest party is ordering enough pasta to feed 10 people). Let’s focus on size distribution in delivery orders. Let’s assume you now have two other distributions from your modeling (one is logistic regression, the other is SVM)

Size	$P$ (historical)	$Q$ (Logistic regression)	$Q^\prime$ (SVM)
S	0.05	0.01	0.05
M	0.85	0.95	0.20
L	0.10	0.04	0.75

Clearly, both models are inaccurate on this dimension. The logistic regression overestimates medium orders, while the SVM model underestimates them by a lot. From this data, I expect the KL divergence of $Q^\prime$ to be much worse than that of $Q$ as it grossly underestimates the most likely event, and this is confirmed by calculating them with $K(P || Q) \simeq 0.08$ while $K(P||Q^\prime) \simeq 1.03$ . In this example, the PSI agrees with this ranking (because $Q^\prime$ is just much worse than $Q$ ) and even widens the gap with $\mathrm{PSI}(P, Q) \simeq 0.13$ and $\mathrm{PSI}(P, Q^\prime) \simeq 2.25$ .

This actually makes a lot of sense. I mentioned that the KL divergence is very sensitive to underestimation of likely events, but conversely it is a bit blind to overestimations of unlikely events, as these weigh a lot less by definition. The PSI is a lot more equitable there. So the PSI for $Q^\prime$ is also penalizing the overestimation of large orders.

Distribution measure with a scale

As I just showed, you can now use either the KL divergence or the PSI to estimate how far your modeling distributions are from the historical ones. The only issue is: what does “far” mean? Is $K(P || Q) = 0.5$ good? To be honest, there is no really good answer to this question, at least conceptually. The main reason for this is that neither of the two measures have a notion of scale. Worse than that, the definition of surprise is only really constrained up to a proportionality constant. So you could rescale everything contribution and still keep a rather consistent definition of your measures.

If there is no intrinsic scale, then all you have to do is compare two distributions to $P$ . In the previous example, we had two models, so it was easy to just compare them to each other. If you don’t have that luxury, you can just build a reference value yourself.

For example, let’s focus on the KL divergence. If you know the value of the measure w.r.t. some other distribution which you know is good or bad, then you can just compare your value to that one. But if you look closely, when you take the difference between two KL divergences, the entropy terms for $P$ cancel out

K(P || Q) - K(P || Q^\prime) = \mathbb E_P[I_Q] - \mathbb E_P[I_{Q^\prime}].

And this is where things get interesting: there are plenty of distributions you can use as a reference. Let’s look at the average surprise $\mathbb E_P[I_{Q_{\mathrm{ref}}}]$ for a few of them:

the first one to come to mind is the uniform distribution $U$ for which $\mathbb E_P[I_Q] = \log N$ (where $N$ is the number of values $X$ can take),
if you know the mean $\mu$ of $P$ , you can model it via an exponential distribution $\mathrm{Exp}_\lambda(x) = \lambda e^{-\lambda x}$ , with $\lambda = \frac{1}{\mu}$ . The average surprise for this distribution is $\mathbb E_P[I_{\mathrm{Exp}}] = \log(\mu) + 1$ .
If you know both the mean $\mu$ and variance $\sigma^2$ , you can model your distribution via a Gaussian distribution $\mathcal N_{\mu, \sigma^2}(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$ , and the average surprise is $\mathbb E_P[I_{\mathcal N}] = \frac{1 + \log(2 \pi \sigma^2)}{2}$ .

Something interesting about these quantities: if you want fit a Gaussian distribution or an exponential distribution, all you have to do is minimize $\mathbb E[I_{\mathrm{Exp}}]$ over $\lambda$ or $\mathbb E[I_{\mathcal N}]$ over $\mu, \sigma^2$ . This is the essence of maximum likelihood estimation and a cornerstone of machine learning.

If you want to compare a distribution $Q$ to a “true” distribution $P$ , taking a uniform distribution or Gaussian distribution as a reference, all you have to do is compute the average surprise for your distribution $\mathbb E_P[I_Q]$ and compare this to some parametrized expression. For example, if you have a discrete random variable and $\mathbb E_P[I_Q] > \log N$ , you can think of this as your distribution being worse than a uniform distribution on likely and certain events.

I hope this was as useful to you as it was to me. None of this is particularly new but what I like about this approach is that it carries a lot more intuition about these quantities and why one might want to minimize or maximize them, instead of just treating them as abstract objectives which “just” work.

Footnotes

This is true for discrete distributions or distributions with compact support only. On $\mathbb R$ , things are a bit more complex. ↩
I’ll admit that this last point may feel a bit arbitrary, and I don’t really have a good way to explain why it should be the case, but somehow the converse feels more wrong. If two events are independent, then why would you be more (or less) surprised by them happening together than by them happening separately? Somehow, having $I$ be additive for independent events seems like a natural property to require in this case. ↩