Random weighted sampling in a hurry
I recently encountered the following problem: I have some rows in a table on BigQuery, and I want to choose one (per group) using the values of column A as a weight. But all I have access to (as far as I know) are basic random numbers RNG(). This isn’t the first time I face this situation, and whether it’s in Python or in Rust, I usually just hope there is some implementation choose for slices of arrays handling weights which I can just use. But in SQL, this may not always (or ever?) be the case. Well, fear not, if you have access to random numbers, there is always a solution to any problem! In this short post, I will share how and why a simple algorithm can allow you to use your basic random variables to efficiently sample elements using weights.
This isn’t an academic article, none of what I’m going to write here is new. There are quite a few variants of the algorithm in question under the name of Gumbel-Max Trick, ES algorithm (where ES stands for Efraimidis and Spirakis), etc. In the reference section you will find more details about these from people a lot more qualified than I am. Here, I just want to give you an intuition as to why this works and why it’s so simple, and therefore I will focus on the ES variant of the algorithm which only requires one logarithm instead of two.
The main idea is to turn the sampling question into an ordering problem. Let’s assume you can make as many independent drawings of as you want (using e.g., rng.random() in your favorite library, or RNG() if your SQL engine allows it), and that you have positive weights for each of your candidate elements. The core idea of the ES algorithm is that the probability that, for a given , is the largest is exactly equal to where . So essentially, randomly sampling one candidate (using the weights) is equivalent to sorting all generated values and picking the largest. That’s it, easy!
Of course, I don’t expect you to just believe me, and since the proof for this statement is rather simple, let’s go through it together. The first thing we can start doing is working with the variable . Since the exponential is a monotonically increasing function, all our inequalities for will apply to as well, and we can get rid of the log. Note that this is useful for the proof, but in general it is slower to raise a number to the power than it is to apply a logarithm, which is why I prefer to use in numerical applications.
So what we need to prove here is that for every , which in English means that the probability of any to have the largest score is proportional to its weight. To do this, we can start rewrite this probability as1
This equation basically looks at every value that can take and sums (in a continuous manner) all probabilities that all the other variables are smaller than this value. I used that, since takes values in , so does , and therefore the integral is bounded by this interval. To find , it’s a bit easier to first compute the cumulative distribution function (CDF) and then remember that the probability density function (PDF) is just its derivative . So let’s start with the CDF
where in the first equality, I just raised each side to the power , and in the second equality I used the definition of the CDF of . From this, the PDF is straightforwardly . Putting this back into our main equation, we get
With this, we finally prove that . So each candidate has a probability proportional to its weight of having the largest score. As an example, you can apply this in some SQL dialects in the following way
SELECT * FROM some_tableQUALIFY ROW_NUMBER() OVER(ORDER BY LOG(RNG()) / weight_column DESC) = 1What about if I need to select elements out of candidates?
The beauty of this method is it generalizes well to multiple candidate sampling, as you can just take the first elements instead of only the maximum one.
Lastly, I will leave you with the similarity to the Gumbel-Max trick. The main difference is that the score you build and sort your data with is , where if . The rest of the logic is the same, except that each element is sampled with log-normalized probability . To make this work with our problem here, you need to relate the weights to the parameter using so that the probability of sampling a given element is proportional to the weight instead of an exponentiated weight.
References
- Gumbel-Max Trick and Weighted Reservoir Sampling
- The Gumbel-Max Trick for Discrete Distributions
- Weighted random sampling with a reservoir, Efraimidis and Spirakis (2005)
Footnotes
-
Here I am slightly abusing the notation to mean the PDF which I calculate below. ↩