A brief history of entropy

I recently posted about entropy in information theory and how it relates to the information theoretic notion of surprise. That post was based on what is called the Shannon entropy¹ (named after Bell Labs engineer Claude Shannon), which is essentially a measure of the expected amount of information carried by a random variable. A widespread anecdote about Shannon entropy is that von Neumann, a famous physicist who also has an entropy named after him, advised him to call it “entropy” because nobody knows what it really is so “you always have the advantage.”² As it turns out, Shannon and von Neumann entropies are essentially the same thing but in different contexts: the former measures the information expressed by a random variable while the latter measures the statistical uncertainty of a quantum system. So it was well-named after all! In this post, I want to focus on the physics aspect of entropy, but what I will say about it is also quite relevant to information theoretic purposes, and to a wider extent to any statistical field.

Entropy in thermodynamics

Entropy was first coined in the XIX^th century by Rudolf Clausius in the context of thermodynamics. This was in the middle of the industrial revolution, and scientists were focused on understanding how to convert heat into work more efficiently to build better engines. The way this usually goes is you have two heat reservoirs (which are basically two large systems whose temperatures don’t change much even when exchanging heat, e.g., the atmosphere tends to remain at ambient temperature even when you turn on your oven) with temperature $T_H > T_C$ (hot and cold). By transfering thermal energy from the hot reservoir to the cold one, you can extract mechanical energy! This is heat-work conversion, governed by the equation

W = Q_H - Q_C

This type of engine is called a heat engine, and it is cyclical: every cycle the engine returns to its initial state (so the total energy of the system didn’t change) but work was done. For example, in a steam engine, water is heated and boiled to produce steam, the steam pushes a piston (thereby doing work) and then the steam cools down and condenses back to water. This kind of process is the basis of modern technology! The efficiency of this process is the ratio between the work done and the heat extracted from the hot reservoir $\eta = \frac{W}{Q_H} = 1 - \frac{Q_C}{Q_H}$ .

But what Clausius noticed is that this process is not fully efficient: some of the energy that could theoretically be converted into work is lost as extra heat to the cold reservoir. In an ideal reversible engine, the efficiency is maximal and given by the Carnot efficiency $\eta_C = 1 - \frac{T_C}{T_H}$ , but in practice, the efficiency is always lower than that. The constraint $\eta < \eta_C$ can be rewritten as

\frac{Q_C}{T_C} - \frac{Q_H}{T_H} > 0

By defining the ratio $\frac{Q}{T}$ as the entropy variation $\Delta S$ of the reservoir (hot or cold), the constraint implies that the total entropy of the engine and both reservoirs is always growing.

This is also the reason why the process is irreversible. You can’t exchange heat from the cold to the hot reservoir while also extracting work and without violating the second law of thermodynamics $\Delta S > 0$ . However, if you provide work ( $W \to -W$ ), then you can cool down your cold reservoir while still increasing the system’s total entropy. This is essentially how your fridge works!

If like me you’re not a big fan of classical thermodynamics with all its reservoirs and engines, trust me it gets more interesting from here. Because while thermodynamics is where entropy was first introduced, the fields of statistical physics and quantum mechanics have brought a much deeper understanding of what entropy really is.

Entropy in statistical physics

A few decades after Clausius’s seminal work, the field of statistical physics was kicked off singlehandedly by the work of Ludwig Boltzmann. While Einstein and Hawking may have been the more famous physicists of recent times (particularly among the general public), Boltzmann stands among them as one of the most influential physicists of all time, even though he sadly never got such recognition while he was alive.

Statistical physics describes systems with large numbers of degrees of freedom (for example the air in your room or the water molecules in your glass) as a statistical ensemble of states, labeled by the individual components’ attributes (position of oxygen or water molecules, their velocities, etc.) When the system is at equilibrium, it can be modeled as a random variable over some ensemble of states with a given probability distribution.

At this point, I feel the need to point out the difference in roles played by probabilities in statistical physics compared to quantum physics. Quantum systems are inherently probabilistic; even with one particle, a quantum system will showcase probabilistic behaviour such as wavefunction collapse. On the other hand, statistical physics is the result of a large collection of elements (which we often call “particles”, but they are not necessarily quantum), whose behaviour is hard to keep track of and can be better explained probabilistically. This is a lot more like describing a crowd or a population. Individuals may behave deterministically, but together they obey a probability distribution.

As usual, this is made more complicated by the fact that you very much can (and should) mix statistical and quantum physics to describe the behaviour of large collections of quantum particles. For example, electrons flowing in a metal under the influence of an electric current.

In the microcanonical ensemble, equilibrium states have fixed energy $E$ , number of elements $N$ and volume $V$ , and all such states are equally likely with probability $P = 1/W$ where $W$ is the total number of microstates. Let’s try to gain some intuition for this! Sadly, my trusty ambient air or water glass examples don’t really work here (for reasons that’ll become more apparent later), so let’s just craft one.

Let’s take two six-sided dice. The possible states of the whole system are all the combinations $(n,m)$ with $1 \leq n, m \leq 6$ . Now let’s artificially call the quantity $E = n + m$ the energy of the system (clearly, this has nothing to do with the usual notion of energy). In microcanonical ensembles, all values $E$ define a different equilibrium state whose microstates $(n,m)$ are all equiprobable (for $E=8$ , these would be $(2, 6), (3, 5), (4,4), (5, 3), (6,2)$ ). Even if you keep rolling the dice constantly, as long as the total remains fixed, you can consider the system at equilibrium. This is illustrated in the figure below.

Repartition of possible states of two dice according to their energy levels (where energy is the sum of the dice values).

Within this context, you can define entropy as $k_B \log W$ where $k_B$ is called Boltzmann’s constant.³ In my opinion, this is the most important definition of entropy. When you hear around that entropy is a measure of disorder, this is what it refers to. To make that clearer, let’s look at the entropy of each equilibrium for our dice system:

Energy $E$	Number of microstates $W$	Entropy $S/k_B = \log W$
2	1	0
3	2	$\log 2 \simeq 0.69$
4	3	$\log 3 \simeq 1.10$
5	4	$\log 4 \simeq 1.39$
6	5	$\log 5 \simeq 1.61$
7	6	$\log 6 \simeq 1.79$
8	5	$\log 5 \simeq 1.61$
9	4	$\log 4 \simeq 1.39$
10	3	$\log 3 \simeq 1.10$
11	2	$\log 2 \simeq 0.69$
12	1	0

The equilibria with the highest entropy are those for which there are more combinations of dice reaching the same energy, while the entropy of low and high energy states is low because there are fewer (or even just one) microstates available.

What does that have to do with disorder?

Well imagine a “gas of dice” at equilibrium. For a low entropy equilibrium, the gas will be pretty much uniform because there are few diverse combinations to choose from. On the other hand, in a high entropy equilibrium, there will be plenty of combinations of dice to choose from, so the gas will look a lot more diverse, which is what we call “disordered”.

A gas of dice at equilibrium. The higher the entropy, the more disordered the gas looks.

This definition of entropy also already agrees with the Shannon entropy definition. After all, in the microcanonical ensemble, all microstates of a given equilibrium are equiprobable, so the probability distribution of the system is the uniform distribution $U(W)$ . For such a distribution, the Shannon entropy is $H[U(W)] = \log W$ , which is exactly the Boltzmann entropy up to the constant $k_B$ .

The reason I needed to use this dice example instead of working with oxygen or water molecules is because in our common experience, most systems at equilibrium are not described by a microcanonical ensemble, but rather by the canonical ensemble. From a microscopic perspective, there is little difference between the two: the individual states are the same! What matters is what the probability distribution of these microstates is. While in the microcanonical ensemble, the distribution was uniform with compact support (over fixed $E$ ), in the canonical ensemble, we use the Boltzmann distribution $P(E) = \frac{e^{-E / k_B T}}{Z(T)}$ where $Z(T) = \sum_E e^{-E / k_B T}$ is a normalization factor called the partition function, and $T$ is the temperature of the system (assumed fixed in this ensemble).

Now at this point, you must have noticed that I haven’t really mentioned temperature since we left the industrial revolution. A reasonable definition of temperature, based on our microcanonical ensemble, and inspired by what we know of classical thermodynamics, is as the inverse of the derivative of entropy w.r.t. energy $\frac{1}{T} = \frac{\partial S}{\partial E}$ .

From the canonical probability distribution $P$ , you can get an expression for the entropy of the system as $S = -k_B \sum_E P(E) \log P(E)$ which, once again, is just the Shannon entropy of the Boltzmann distribution! This expression simplifies greatly to

\begin{aligned} S & = - k_B \sum_E \frac{e^{-E / k_B T}}{Z} \left( \frac{-E}{k_B T} - \log Z \right)\\ & = k_B \log Z + \frac{k_B T}{Z} \frac{\partial Z}{\partial T} \\ & = -\frac{\partial F}{\partial T}~, \end{aligned}

where $F = -k_B T \log Z$ is called the Helmholtz free energy of the system. You can think of $F$ as the total amount of energy which can be extracted as work from a closed system at fixed temperature.⁴ After all this work, we can retroactively justify Clausius’s definition from first principles

\begin{aligned} \Delta S & = \int \mathrm{d} S\\ & = \int \frac{\partial S}{\partial E} \mathrm{d} E\\ & = \frac{1}{T} \int \mathrm{d} E = \frac{Q}{T}~, \end{aligned}

thus recovering the relationship between entropy, temperature and heat.⁵

As you can see, statistical physics essentially grounds the intuitive work from classical thermodynamics, which was born out of experiments and engineering observations, into a more rigorous statistical framework. What I really like about this is that a lot of the math behind thermodynamics goes from “magic trick” to “oh I’m just calculating a moment of a distribution.”

Entropy in quantum mechanics

We’re entering the last stretch of this little tour. If you haven’t already, I really suggest you read this introduction to quantum mechanics I wrote a while back. As a brief reminder, recall that quantum states are different from the classical states I mentioned in the previous section. They are described by vectors inside a complex vector space called a Hilbert space. Within this formalism, linear operations on states represent transformations and actions like evolution in time, interactions between states, etc. The basis of quantum mechanics is essentially linear algebra applied to physics.

To introduce entropy in quantum mechanics, I first need to introduce a new quantity, beyond the state vectors formulation I discussed in the introduction, which encodes the entire probability distribution of the state of a system. This quantity is called the density matrix $\rho$ which, for a given state, is defined by $\rho = \sum_i p_i |\psi_i\rangle \langle \psi_i|$ where $p_i$ is the probability of the system being in the basis state $|\psi_i\rangle$ .⁶

First things first, the density matrix is an operator, so it acts on states and returns another state $\rho |\psi\rangle = |\psi'\rangle$ . By definition, the new state $|\psi'\rangle$ is a linear combination of the possible states $|\psi_i\rangle$ , each weighed by the probability $p_i$ as well as the overlap with the basis states. The nice thing with the density matrix is that it allows you to express the expectation value of any observable $\mathcal O$ as a trace $\langle \mathcal O \rangle = \text{Tr}(\rho \mathcal O)$ . This is exactly like the expectation value of a random variable $X$ with probability distribution $P$ over an infinite space like $\mathbb R$ .

By now, I hope you see where this is going. If you have a probability distribution, and a way to express an expectation value, you can define an entropy as $S = -k_B \text{Tr}(\rho \log \rho)$ which is called the von Neumann entropy. This is exactly the quantum analogue of the Shannon entropy. Actually, the expression for this entropy is symmetric under basis changes! So if you change your basis to the one in which $\rho$ is diagonal, you get $S = -k_B \sum_i q_i \log q_i$ which is the Shannon entropy of the probability distribution $P$ defined by the eigenvalues $q_i$ of $\rho$ .

Of course, like in thermodynamics, this entropy provides a measure of the uncertainty and disorder of the system. The simplest way to view this is from the eigenvalues formulation. The Shannon entropy of the distribution defined by $q_i$ is high when the distribution is close to uniform, and low when the distribution is peaked around a few values. When reverting to the density matrix itself, the eigenvalues encode how “dimensional” the operator is. If $\rho$ has only one non-zero eigenvalue, then the operator is essentially a projector on a single pure state. For a few peaks, the projection is on a low-dimensional subspace. But for a uniform distribution, the operator is “full rank” and the state is very mixed.

Bonus: entropy in black holes

I hesitated a bit about whether to include this section or not. Black hole entropy is a very cool topic, but it requires a lot of background to be properly explained. But I’d be remiss if, on a tour of entropy in physics, I didn’t at least mention one of the most fascinating appearances of the concept.

You will have to trust me on a lot of the details here, but the starting point of this entire story is that black holes, these extreme astrophysical objects with such strong gravity that not even light can escape,⁷, behave a lot like simple thermodynamic systems. This was most famously noticed by Jacob Bekenstein and Stephen Hawking in the ’70s.

This is the part where I will be very hand-wavy. In Einstein’s theory of general relativity, an observer who moves in space with a constant acceleration $a$ (like in a rocket) will be able to measure thermal radiations consistent with a perfect heated black body at a temperature $T = \frac{\hbar a}{2\pi k_B c}$ where $\hbar$ is the reduced Planck constant and $c$ is the speed of light. This is called the Unruh effect. It is a consequence of mixing quantum mechanics and relativity,⁸ and particularly mixing quantum mechanics and the notion of horizons.

When you are in a rocket, you can’t see all of spacetime. Since you are accelerating, there are regions from which signals would never reach you (as it would require light to “catch up” but it can’t accelerate). The boundary of these inaccessible regions is called a horizon, and it is the exact same phenomenon that happens around a black hole. So there as well, quantum effects near the horizon cause the black hole to essentially emit thermal radiation, called Hawking radiation, with a temperature dependent on the local acceleration of the gravitational field at the horizon⁹ $T = \frac{\hbar \kappa}{2 \pi k_B c}$ where $\kappa = \frac{c^4}{4 G M}$ is the surface gravity which depends on $M$ the mass of the black hole.

So a black hole emits thermal radiation, surely you can imagine after our long journey through thermodynamics and statistical physics that if a system has a temperature and energy, then it has entropy. Thanks to Einstein’s most famous equation, we can estimate the black hole energy to be $M c^2$ . Using the simple thermodynamic relation $\frac{1}{T} = \frac{\partial S}{\partial E}$ we derived earlier (how handy!), we can integrate¹⁰ to get the entropy

S = \int \frac{\mathrm{d}E}{T} = \frac{8 \pi G k_B c^3}{\hbar c^4} \int_0^M M^\prime \mathrm{d} M^\prime = \frac{4 \pi G k_B}{\hbar c} M^2~.

Let’s take a little break and look at what we got here. $T \propto 1/M$ so the more massive a black hole is, the lower its temperature is! But since $S \propto M^2$ , the more massive a black hole is, the larger its entropy. Returning to the usual “order/disorder” interpretation of entropy, this just tells you that whatever microstates describe a black hole, they must be growing with its mass. What is fascinating is no one knows what those look like!

As a last hurrah, let me throw a little equality at you. The radius of the black hole is given by $r = \frac{2 G M}{c^2}$ . Assuming the black hole is perfectly spherical, its area is $A = 4 \pi r^2 = \frac{16 \pi G^2 M^2}{c^4}$ . Why is this interesting? Because the entropy of the black hole is then $S = \frac{k_B c^3}{4 G \hbar} A$ .

I know, I know, this has been a long post and at this point, why do we care that $S \propto A$ ? In all of classical and quantum physics I have shown you, there has been one consistent property of entropy. It is an extensive property, i.e., it grows with the volume of the system. But black holes are special, and their entropy grows with the surface. This has been interpreted as a signal that all the black hole microstates live on its boundary and not in its interior. This is often called the holographic principle and is the foundation of my doctoral thesis and four years of my life.

Conclusion

If you’re actually reading this, you made it! This has been a longer post than I thought it would. I just started writing and didn’t stop until I felt I had done justice to a concept as cool as entropy. If you’re a wandering expert stumbling on this, forgive the inaccuracies and shortcuts which I felt were necessary to keep this manageable. If you’re just a curious reader who’s never really understood what the hell entropy is all about, I hope this has sated your curiosity, and even better I hope this has opened up a new interest for you.

Footnotes

Shannon, C. E. (1948). “A mathematical theory of communication.” The Bell system technical journal, 27(3), 379-423. ↩
M. Tribus, E.C. McIrvine, Energy and information, Scientific American, 224. ↩
You don’t have to think too hard about this constant, as a rule of thumb, think of it as a conversion factor between the units of energy and temperature, so that $k_B T$ is an energy. ↩
There are actually other free energies in physics, such as the Gibbs free energy, defined for the grand canonical ensemble. The main difference between each ensemble is usually what variables you assume fixed: the grand canonical ensemble assumes a fixed temperature and something called a chemical potential $\mu$ , while the canonical ensemble assumes a fixed temperature and number of particles. To illustrate this, for a gas in a box, you’d use the canonical ensemble since all gas molecules are trapped in the box. But if you take an open system, like a hot cup of coffee for example, then the grand canonical ensemble is more appropriate since the coffee can exchange heat and particles with the environment. Some water molecules will evaporate from the cup, some will actually condense back into the cup, and this is done at a fixed rate controlled by the chemical potential $\mu$ . ↩
This derivation only truly works in the limit of a system with many particles, called the thermodynamic limit. This is because it relies on the assumption that $S$ is a continuous function of $E$ , but if the system is small, $S(E) = k_B \log W(E)$ is too discrete for this to be valid. ↩
For simplicity, I will assume that for every basis state $\langle \psi_i | \psi_i \rangle = 1$ and that the basis states are orthogonal $\langle \psi_i | \psi_j \rangle = 0$ for $i \neq j$ . ↩
This is a well-known fact, but let me clarify here: light cannot escape from within the event horizon of the black hole (an area often much larger than the black hole itself) because the gravitational field of the black hole is so strong it bends spacetime in a way that light, always traveling “forward” in a certain sense, gets trapped inside. It’s not a question of pulling light back, but rather that the trajectories of light rays are bent towards the interior of the black hole, always. ↩
If you’ve ever heard that quantum mechanics and relativity cannot be used together, this is not true. What current physicists are looking for is a quantum explanation of gravity itself. But quantum particles in a classical gravitational field are well-understood. ↩
This is a scrabble-winning formula when it comes to physics constants: $G$ is the gravitational constant, $k_B$ is Boltzmann’s constant, $\hbar$ is the reduced Planck constant, and $c$ is the speed of light. This is why physicists often work in “natural units”: we essentially pretend like $\hbar = c = G = k_B = 1$ to make formulas easier. If we ever need to convert to an actual number, we just need to remember what the dimension of the quantity is (is it a velocity? mass? etc.) and then multiply it by the appropriate ratio of constants to get the right units ( $c$ for a velocity, $\sqrt{\frac{\hbar c}{G}}$ for a mass, etc.) ↩
In the derivation, I made the subtle assumption that $S(M \to 0) \to 0$ . This is quite consistent with the idea that entropy should vanish when there are no states left (well usually it’s 1, but here it’s a continuous approximation). ↩