Choose a Random SQL Table Row Efficiently

How do you select a random row from a SQL table?

I can find three popular solutions on the web. The first is to sort the table randomly and take the first value. The second is to select a random number between 1 and the number of rows in the table, traverse that many rows down the table, and select that row. The third is to take a unique ID column from the table, select all of its values, select a random ID from there, and then return to the database for the remainder of the row.

Here they are in SQL-like pseudocode.

SELECT * FROM tbl ORDER BY rand() LIMIT 1 -- Random sort method
SELECT * FROM tbl LIMIT 1 OFFSET rand(1 through tbl rows) -- Random offset method
SELECT id FROM tbl ; SELECT * FROM tbl WHERE id = some_id -- Random ID method

The problem with both of these methods is that they are inefficient for large tables. The random sort method requires sorting the table, which takes at least O(n log n) time, where n is the number of rows in the table. The random offset method needs to walk through between one and n rows of the table, which takes O(n/2) time. The random ID method needs to gather all the ID values in the table, which requires O(n) time. In other words, the larger the table, the longer each method takes.

Ideally, it would be possible to select a random row from a table in linear, or perhaps logarithmic time. I have come up with a way that nearly does this. It requires an extra table column and does not provide a perfectly random distribution, but these problems may be minor compared to the savings in algorithmic time.

The Random Row Selection Method

First, add a column to the table. I will call it distribution. Index the column. Assign each row with a distribution value between 0 and 1.

Second, to select a row, choose a random number between 0 and 1. Select the first row whose value is less than the random number.

SELECT * FROM tbl WHERE distribution < rand(0 through 1)
    ORDER BY distribution DESC
    LIMIT 1

Sometimes this will return no rows (when the random number was smaller than any value in the distribution column). In that case, “wrap around” to select the value with the highest distribution value.

SELECT * FROM tbl
    ORDER BY distribution DESC
    LIMIT 1

Because the column is indexed, these operations should take logarithmic time, O(log n).

We could just stop here. However, this would result in some of the rows being selected more often than others. As a result, it is best to “churn” the distribution column a bit by changing up the values. The most obvious value to change is that of the row we just selected.

UPDATE tbl SET distribution = rand(0 through 1) WHERE id = [row we just selected]

Experimental Results

I haven’t implemented this in SQL yet, but I have run some simulations and looked at some statistics.

In general, this method produces a fairly even distribution of rows. Indeed, it sometimes appears that the distribution is more even than a truly random row selection method. With true randomness, you would expect that some rows are selected more often and others less often for any given period, but my algorithm seems to select each row with about the same frequency.

However, there is one very clear point of non-randomness. If a row is selected once, then the probability that it will be selected immediately afterwards is about two to three times higher than it should be. In other words, the algorithm seems to prefer picking the same rows consecutively.

If you think about this, it makes sense. Each time that you pick a row and move its distribution value, you create a “hole” in the space of distribution values.

. . . . . . . .  Moving a distribution value
  |      ^
   \----/

.   . . ... . .  creates a large hole
|---| (the hole)

Since this hole is larger than normal, the algorithm is (1) more likely to relocate a row’s distribution value into the hole, and (2) more likely to pick a random value in that hole immediately thereafter, thus picking the same row twice in a row.

I imagine, though, that there are tweaks to this algorithm that could alleviate this problem (for example, relocating more than one row’s distribution value at a time). Additionally, some people might prefer an algorithm that never picks the same row twice in a row. There are probably simple tweaks to the algorithm that would allow for this.

Closing notes

If you are skilled in probability or statistics, I would be interested in a more complete analysis of my algorithm.

Also, if you have a table with sequential IDs, don’t use this algorithm. Just find the maximum ID in the table, pick a random number between 1 and that maximum ID, and find the corresponding row. If there is no matching row, try again until you find one. So long as the maximum ID is not much greater than the number of rows in the table, this will be highly efficient.