Random weighted sampling in a hurry

I recently encountered the following problem: I have some rows in a table on BigQuery, and I want to choose one (per group) using the values of column A as a weight. But all I have access to (as far as I know) are basic random numbers RNG(). This isn’t the first time I face this situation, and whether it’s in Python or in Rust, I usually just hope there is some implementation choose for slices of arrays handling weights which I can just use. But in SQL, this may not always (or ever?) be the case. Well, fear not, if you have access to random numbers, there is always a solution to any problem! In this short post, I will share how and why a simple algorithm can allow you to use your basic random variables $X \sim \mathrm{Unif}(0, 1)$ to efficiently sample elements using weights.

This isn’t an academic article, none of what I’m going to write here is new. There are quite a few variants of the algorithm in question under the name of Gumbel-Max Trick, ES algorithm (where ES stands for Efraimidis and Spirakis), etc. In the reference section you will find more details about these from people a lot more qualified than I am. Here, I just want to give you an intuition as to why this works and why it’s so simple, and therefore I will focus on the ES variant of the algorithm which only requires one logarithm instead of two.

The main idea is to turn the sampling question into an ordering problem. Let’s assume you can make as many independent drawings of $X$ as you want (using e.g., rng.random() in your favorite library, or RNG() if your SQL engine allows it), and that you have positive weights $w_i$ for each of your candidate elements. The core idea of the ES algorithm is that the probability that, for a given $i$ , $Y_i = \log(X_i)/w_i$ is the largest is exactly equal to $w_i / W$ where $W = \sum_j w_j$ . So essentially, randomly sampling one candidate (using the weights) is equivalent to sorting all generated values $Y_i$ and picking the largest. That’s it, easy!

Of course, I don’t expect you to just believe me, and since the proof for this statement is rather simple, let’s go through it together. The first thing we can start doing is working with the variable $Z = e^Y = X^{1/w}$ . Since the exponential is a monotonically increasing function, all our inequalities for $Z$ will apply to $Y$ as well, and we can get rid of the log. Note that this is useful for the proof, but in general it is slower to raise a number to the power $1/w$ than it is to apply a logarithm, which is why I prefer to use $Y$ in numerical applications.

So what we need to prove here is that $P(Z_i \geq Z_1, \ldots, Z_n) = \frac{w_i}{W}$ for every $i$ , which in English means that the probability of any $Z_i$ to have the largest score is proportional to its weight. To do this, we can start rewrite this probability as¹

P(Z_i \geq Z_1, \ldots, Z_n) = \int_0^1 \mathrm{d}z P(Z_i = z) \prod_{j \neq i} P(Z_j \leq z)~.

This equation basically looks at every value $z$ that $Z_i$ can take and sums (in a continuous manner) all probabilities that all the other variables are smaller than this value. I used that, since $X$ takes values in $[0, 1]$ , so does $z$ , and therefore the integral is bounded by this interval. To find $f(z) \equiv P(Z_i = z)$ , it’s a bit easier to first compute the cumulative distribution function (CDF) $F(z) \equiv P(Z_i \leq z)$ and then remember that the probability density function (PDF) is just its derivative $f(z) = F^\prime(z)$ . So let’s start with the CDF

F_i(z) = P(Z_i \leq z) = P(X_i \leq z^{w_i}) = z^{w_i}

where in the first equality, I just raised each side to the power $w_i$ , and in the second equality I used the definition of the CDF of $\mathrm{Unif}(0, 1)$ . From this, the PDF is straightforwardly $f_i(z) = F_i^\prime(z) = w_i z^{w_i - 1}$ . Putting this back into our main equation, we get

\begin{aligned} P(Z_i \geq Z_1, \ldots, Z_n) & = \int_0^1 \mathrm{d}z \, P(Z_i = z) \prod_{j \neq i} P(Z_j \leq z)~,\\ & = \int_0^1 \mathrm{d}z \, w_i z^{w_i - 1} \prod_{j \neq i} z^{w_j}~,\\ & = \int_0^1 \mathrm{d}z \, w_i z^{w_i - 1} z^{\sum_{j \neq i} w_j}~,\\ & = \int_0^1 \mathrm{d}z \, w_i z^{W - 1}~,\\ & = \frac{w_i}{W}~. \end{aligned}

With this, we finally prove that $P(Z_i \geq Z_1, \ldots, Z_n) = \frac{w_i}{W}$ . So each candidate has a probability proportional to its weight of having the largest score. As an example, you can apply this in some SQL dialects in the following way

SELECT * FROM some_table
QUALIFY ROW_NUMBER() OVER(ORDER BY LOG(RNG()) / weight_column DESC) = 1

What about if I need to select $n$ elements out of $N$ candidates?

The beauty of this method is it generalizes well to multiple candidate sampling, as you can just take the first $n$ elements instead of only the maximum one.

Lastly, I will leave you with the similarity to the Gumbel-Max trick. The main difference is that the score you build and sort your data with is $Y_i^\prime = -\log(-\log X) + q_i$ , where $-\log(-\log X) \sim \mathrm{Gumbel}(0,1)$ if $X \sim \mathrm{Unif}(0,1)$ . The rest of the logic is the same, except that each element is sampled with log-normalized probability $\frac{e^{q_i}}{\sum_j e^{q_j}}$ . To make this work with our problem here, you need to relate the weights to the parameter $q$ using $q_i = \log w_i$ so that the probability of sampling a given element is proportional to the weight instead of an exponentiated weight.

References

Footnotes

Here I am slightly abusing the notation $P(Z_i = z)$ to mean the PDF $P(z \leq Z_i \leq z + \mathrm{d}z)$ which I calculate below. ↩