How unlikely are collisions with Short UUIDs?

Posted on Apr 7, 2024

I was working with a DataFrame today and I wanted to assign random IDs to each row. As it was a relatively small df (just 61K rows) I thought “better not waste resources with full UUIDs, let’s just use a short UUID”.

A UUID (v4) is a string with 32 randomly generated characters (0–9 and A-F) which looks like: c983837d-4b53–413c-bc1b-6c14cc73b745 . For readability reasons, it’s split using dashes, the first octet is usually referred to as “Short UUID”, in this case it’s c983837d.

Now, how many combinations of Short UUIDs are possible? Given that it has 8 characters, and there are 16 possible characters, there’s a total of 16⁸ possible combinations, which is equals to: 4,294,967,296.

So, the probability of a collision with a Short UUID is 1/4,294,967,296. Seems like a pretty low chance, right?

Well, the reality is a bit more paradoxical. The probability of a collision with ONE random UUID is pretty low (1/4B), but how likely is it to occur with 61K records? To calculate it, we need a small interlude:

The Birthday Paradox

This is an analogous problem to the Birthday Paradox, which is a pretty counterintuitive probability exercise. It goes something like:

Imagine you assemble a group of people. 10 people, 20 people, 50 people. It can be a group of friends, a football team, a college class, doesn’t matter. The Birthday Problem asks the question: What is the probability that 2 or more people share a birthday? So, for example, in a room with 20 people, what’s the probability that 2 of those people share a birthday? It seems low, right?

But there’s an interesting derivation of the Birthday problem, which is: how many people are needed to reach a probability of 50% of a “collision” in birthday dates? The answer is SURPRISING and counterintuitive: only 23 people are needed to have a 50% probability of a collision (2 people sharing the same birthday).

We actually have a Project on DataWars in which we analyze NBA teams and find players sharing a birthday. Check it out if you want to see the Birthday Problem live.

How likely is a collision with Short UUIDs?

We can use the Birthday paradox to calculate the probability of a Short UUID collision for 61K records.

The formula to calculate the probability of a collision given n elements each with probability 1/N is difficult to calculate, but the Wikipedia page provides a few approximations. The simplest one seems to be:

Image

In this case it’s n for number of people and d for number of days. In our case, we’d have n equals to 61K (the number of records) and d equals to 4B (the number of possible combinations of Short UUIDs). Resolving the equation gives us:

Image

a 35% chance of a collision! It’s not so low after all.

What does this mean?

I’m obviously switching back to full UUIDs. But the most important takeaway for me is how important it is to understand the mathematical foundations of the work we do, and always apply a “first principles approach”.

I started working on this today thinking it was more of a “Data Engineering” sort of task. I just have to process a few rows, invoke a few endpoints, and that’s it.

But when faced with this issue, I had to go back to the fundamentals to create a robust solution.

Data Engineering 🤝 Data Science


This post was automatically migrated from Medium. It still lives at: https://medium.com/@santiagobasulto/how-unlikely-are-collisions-with-short-uuids-e5d3a903e328