July 5th, 2024

PostgreSQL and UUID as Primary Key

Maciej Walkowiak discusses the efficiency implications of using UUIDs as primary keys in PostgreSQL databases. Storing UUIDs as strings versus using the UUID data type impacts performance and scalability, suggesting considerations for database design in high-volume scenarios.

Read original articleLink Icon
PostgreSQL and UUID as Primary Key

Maciej Walkowiak discusses the use of UUIDs as primary keys in PostgreSQL databases. While UUIDs offer benefits like easy generation and uniqueness across distributed systems, their larger size raises concerns about efficiency. Storing UUIDs as strings (text) in PostgreSQL is less efficient compared to using the dedicated UUID data type due to overhead and storage requirements. The article explores the impact of using UUIDs versus text for primary keys through experiments with large datasets, highlighting the performance differences in table and index sizes. Additionally, the discussion delves into the suitability of UUID v7 for B-tree indexes, potentially improving insert performance. Despite optimizations, the article suggests considering alternatives like TSID for primary keys if possible, emphasizing the importance of choosing the right data type from the start for scalability. The insights shared aim to guide database design decisions, especially for scenarios with significant data volumes or high traffic.

Related

Schema changes and the Postgres lock queue

Schema changes and the Postgres lock queue

Schema changes in Postgres can cause downtime due to locking issues. Tools like pgroll help manage migrations by handling lock acquisition failures, preventing application unavailability. Setting lock_timeout on DDL statements is crucial for smooth schema changes.

PostgreSQL Statistics, Indexes, and Pareto Data Distributions

PostgreSQL Statistics, Indexes, and Pareto Data Distributions

Close's Dialer system faced challenges due to data growth affecting performance. Adjusting PostgreSQL statistics targets and separating datasets improved performance. Tips include managing dead rows and optimizing indexes for efficient operation.

Our great database migration

Our great database migration

Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.

Eight versions of UUID and when to use them

Eight versions of UUID and when to use them

The article covers eight versions of UUIDs, detailing their characteristics and best use cases. Recommendations include v4 for random IDs, v7 for sortable IDs, and v5 or v8 for custom data. Some versions have been replaced. The author shares insights and hints at a secretive project.

Just Use Postgres for Everything

Just Use Postgres for Everything

The article promotes using Postgres extensively in tech stacks to simplify development, improve scalability, and reduce operational complexity. By replacing various technologies with Postgres, developers can enhance productivity, focus on customer value, and potentially cut costs.

Link Icon 32 comments
By @inopinatus - 3 months
The best advice I can give you is to use bigserial for B-tree friendly primary keys and consider a string-encoded UUID as one of your external record locator options. Consider other simple options like PNR-style (airline booking) locators first, especially if nontechnical users will quote them. It may even be OK if they’re reused every few years. Do not mix PK types within the schema for a service or application, especially a line-of-business application. Use UUIDv7 only as an identifier for data that is inherently timecoded, otherwise it leaks information (even if timeshifted). Do not use hashids - they have no cryptographic qualities and are less friendly to everyday humans than the integers they represent; you may as well just use the sequence ID. As for the encoding, do not use base64 or other hyphenated alphabets, nor any identifier scheme that can produce a leading ‘0’ (zero) or ‘+’ (plus) when encoded (for the day your stuff is pasted via Excel).

Generally, the principles of separation of concerns and mechanical sympathy should be top of mind when designing a lasting and purposeful database schema.

Finally, since folks often say “I like stripe’s typed random IDs” in these kind of threads: Stripe are lying when they say their IDs are random. They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service. I estimate they typically have 48-64 bits of randomness. That’s still plenty for most systems; you can do the same. Personally I am very fond of base58-encoded AES-encrypted bigserial+HMAC locators with a leading type prefix and a trailing metadata digit, and you can in a pinch even do this inside the database with plv8.

By @fabian2k - 3 months
My somewhat naive understanding was that random UUIDs were not that big of a deal in Postgres because it does not cluster by primary key. And of course a UUID (16 bytes) is larger than a serial (4 bytes) or bigserial (8 bytes) by a factor of 2-4 . This certainly might matter for an index, but on a whole table level where you have 20+ bytes overhead per row this doesn't seem that big of a deal for anything except very narrow tables with large row counts.

So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?

By @bob1029 - 3 months
I see some comments conflating privacy of sequence statistics with global uniqueness considerations and UX.

If your concern is globally unique identifiers (i.e. so that you can merge tables across multiple instances of your database), then UUID is exactly what you want. This is entirely what it is designed for.

If your concern is the privacy of sequence statistics, then UUID incidentally solves your problem. It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.

If you desire a unique identifier that has a high quality UX, then this is in addition to the above columns. This sequence generally has lower entropy than the cryptographic approach (or the UX would suffer), so additional measures should be taken to protect the privacy of the identifiers (e.g. expire redemption tokens after a period of time and reissue via email).

Autoincrementing integers are really nice if you don't actually need UUIDs. Lots of tricks you can apply with them.

By @declan_roberts - 3 months
Call me old fashion but I really like integer autoincrement primary keys. It's easy to understand and obviously simple to sort. Furthermore when working on large batch projects you can just store the last primary key as your high water mark and get everything greater than that.

I suppose TSID works to this end, but certainly more complicated.

By @londons_explore - 3 months
It would be nice for these comparisons to also include 'int64' so people can see how much of an overhead UUID's are compared to the traditional approach.
By @sarreph - 3 months
We chose ULID for our Postgres PK recently, and this article helped a lot in making that decision: https://brandur.org/nanoglyphs/026-ids

I personally prefer ULID since it is compat with a UUID type and you also get a timestamp lexicographically built into the ID so that sorting by ID also means sorting by timestamp. There are multiple PG extensions to make it easy to drop in and use.

By @vog - 3 months
Note that the article's link to the UUID v7 standard is meanwhile outdated. You should instead head directly to RFC 9562:

https://datatracker.ietf.org/doc/html/rfc9562

(which wasn't yet finished at the time of the article)

By @GiorgioG - 3 months
UUIDs are miserable to work with. For 99% of use cases, sequential IDs are fine. Can they be guessed? Sure, but your software should guard against unauthorized access. Security through obscurity or randomness is a poor excuse for using UUIDs as PKs. If you don't want to expose your IDs, use a slug.
By @aabhay - 3 months
I think insert performance is a bad way to evaluate performance here, no? While B-Tree performance for time sorted keys is better on insert, what about during large transactions?

In SQLite, my assumption was that the consensus was towards UUID4 rather than 7 because it meant less likelihood for page cache contention during transaction locks? Would that not also roughly map onto a Postgres-flavored system? Or dues Postgres only have row-level locking?

By @kortex - 3 months
Has there ever been an attack/leak based on time metadata in UUIDs/ULIDs/Snowflakes/etc (not sequential), in a domain outside of sensitive ones (HIPAA/healthcare, defense)?

I'm skeptical that for the vast majority of uses, have time data in the IDs (and also not leaking it from actual timestamp fields) is an issue, and for the cases where it does, just use random IDs.

Even theoretically, is there a way to translate timestamps into a German tank problem? If I give you a sampled set of timestamps (with no other node identifiers), can you estimate the cardinality of the greater dataset?

By @sbuttgereit - 3 months
The article had a link to the PostgreSQL commitfest for UUIDv7 support, but as far as I can't tell it looks unlikely that it will actually be in PostgreSQL 17. The most recent action was the committer being removed from the task and I believe version 17 is already well past feature freeze.... Is my understanding correct? This is what I think is going on, but I can't find any substantiated facts too point me to a definite conclusion.

I was hoping to see it in version 17, but can't get a really good read on what's going on with this feature.

By @netcraft - 3 months
Another day, another article saying not to use UUIDs as PKs. I've maintained systems using UUIDs stored as char(36) with million record tables without issue - This is not an endorsement, just explaining that this is bikeshedding. Should you use v7 when you can? Sure. Would int/bigint be faster in your benchmarks? Sure. But the benefits totally outweigh the speed differences until you get to a very large system. But instead of worrying about this, spend your energy on a million other things first and then celebrate when UUIDs become your bottleneck.
By @brigadier132 - 3 months
My strategy is to use v4 Uuids for anything that is not inserted frequently and don't need to be ordered (think user ids) and v7 ids for things that are.

If your dataset is small the overhead from Uuids wont matter, if your dataset is large the randomness of Uuids will save your ass when you migrate to a distributed solution.

By @branko_d - 3 months
I'm surprised author didn't mention foreign keys: since primary key is often referenced by foreign keys, then if you have a "fat" PK all your FKs will also be "fat". This can be solved by using the UUID as an alternate key and the regular integer as a primary key in the "top" table, and then referencing that integer from all the "child" tables.

Obviously, this has ramifications for data write perf, but may well be worth it depending on the use case. Technically, you could also make the integer alternate (and leave UUID as primary), but that may not de desirable for clustered / index-organized tables (for DBMSes that support them).

By @rootedbox - 3 months
Just a heads up about UUID 7.. be careful when using.

par the RFC

If UUIDs are required for use with any security operation within an application context in any shape or form then [RFC4122] UUIDv4 SHOULD be utilized.

By @hu3 - 3 months
Would have been nice to also include bigserial/bigint in the INSERT performance comparison to understand the impact of migrating those to UUIDv4 and UUIDv7.
By @quotemstr - 3 months
> If you have an option to choose, take a look at TSID maintained by Vlad Mihalcea.

TSID: > A Java library for generating Time-Sorted Unique Identifiers (TSID).

Wouldn't this TSID thing be more useful if it were implemented as a set of PostgreSQL stored procedures or something than a Java library? Not everyone uses Java.

By @AtlasBarfed - 3 months
UUIDs are guaranteed to be unique?

They often use tricks like including the MAC address of the generator machine and other ways to increase uniqueness assurances.

It was my understanding that uuids are simply very very unlikely to duplicate in situations with random generation.

By @Pinus - 3 months
I’m not sure I understand the bit about time-ordered values being a better fit for B-trees. Actually, I’m sure that I don’t understand it... If anything, I’d naively assume that inserting already-sorted values would be a worst case, always going down the same branch of the tree and triggering re-balancing operations left and right (figuratively — literally mostly left!). The Wikipedia article on B-trees notices that there are some shortcuts that can be taken if one knows one is going to do a batch insert of sorted values. Is that what happens? But then how does Postgres know this?
By @rjurney - 3 months
I've started using python-ulid as ULIDs seem superior to UUIDs. https://pypi.org/project/python-ulid/
By @nubinetwork - 3 months
If you're generating random UUIDs as the primary key, how do you not run into key collisions? Having to search the entire table before inserting is slow, and catching the error and trying again is also annoying.
By @leighleighleigh - 3 months
I couldn't really tell you why, but I wrangle many gigabytes of UUID-keyed data for work, and adding a BRIM index yesterday sped things up by 400%. Just my two cents!
By @blits - 3 months
You might also be interested in TypeID, which is extension of UUIDv7 and also sortable and unlike UUIDv7 more human readable.

It is inspired by Stripe IDs and looks like that: user_2x4y6z8a0b1c2d3e4f5g6h7j8k

I've recently built a postgres extension for it that allows to use it like UUID: https://github.com/blitss/typeid-postgres

By @scotty79 - 3 months
Isn't B-Tree with UUIDv4 keys getting more balanced than with UUIDv7? Doesn't longer insert time result in faster searches later?
By @necovek - 3 months
The benchmark for UUIDv7 insertion time includes UUID generation: when looking at simply index update cost, I'd like to see those decoupled.

FWIW, I do expect UUIDv7 to take roughly half the time to generate if it has roughly half the random bits of UUIDv4 and an unprotected source of entropy is used.

By @hakanderyal - 3 months
Using ULID has probably the best ROI among all the architectural decisions I've made. Generating IDs on the server before insert is helpful, and not just for distributed system issues.

If I don't want to leak the timestamp, I just use an auto generated integer along with it.

By @danolivo - 3 months
The big problem of serials is their monotonic grow. Each time adding new record you go beyond statistics boundaries and challenge PostgreSQL to guess..
By @osigurdson - 3 months
I'm surprised there is no mention of hash based indexes (i.e. CREATE INDEX ... USING hash...) since lookups would conceivably always use equality.
By @tantalor - 3 months
> inserting UUID v7 is ~2x faster and inserting regular UUID v4

Okay but why

By @avkrpatel - 3 months
Instead of uuid7 why cant they use ulid?
By @rizky05 - 3 months
I've been using gen_random_uuid() (postgres's function) as default value in primary key. is it any downside ?