PostgreSQL and UUID as Primary Key
Maciej Walkowiak discusses the efficiency implications of using UUIDs as primary keys in PostgreSQL databases. Storing UUIDs as strings versus using the UUID data type impacts performance and scalability, suggesting considerations for database design in high-volume scenarios.
Read original articleMaciej Walkowiak discusses the use of UUIDs as primary keys in PostgreSQL databases. While UUIDs offer benefits like easy generation and uniqueness across distributed systems, their larger size raises concerns about efficiency. Storing UUIDs as strings (text) in PostgreSQL is less efficient compared to using the dedicated UUID data type due to overhead and storage requirements. The article explores the impact of using UUIDs versus text for primary keys through experiments with large datasets, highlighting the performance differences in table and index sizes. Additionally, the discussion delves into the suitability of UUID v7 for B-tree indexes, potentially improving insert performance. Despite optimizations, the article suggests considering alternatives like TSID for primary keys if possible, emphasizing the importance of choosing the right data type from the start for scalability. The insights shared aim to guide database design decisions, especially for scenarios with significant data volumes or high traffic.
Related
Schema changes and the Postgres lock queue
Schema changes in Postgres can cause downtime due to locking issues. Tools like pgroll help manage migrations by handling lock acquisition failures, preventing application unavailability. Setting lock_timeout on DDL statements is crucial for smooth schema changes.
PostgreSQL Statistics, Indexes, and Pareto Data Distributions
Close's Dialer system faced challenges due to data growth affecting performance. Adjusting PostgreSQL statistics targets and separating datasets improved performance. Tips include managing dead rows and optimizing indexes for efficient operation.
Our great database migration
Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.
Eight versions of UUID and when to use them
The article covers eight versions of UUIDs, detailing their characteristics and best use cases. Recommendations include v4 for random IDs, v7 for sortable IDs, and v5 or v8 for custom data. Some versions have been replaced. The author shares insights and hints at a secretive project.
Just Use Postgres for Everything
The article promotes using Postgres extensively in tech stacks to simplify development, improve scalability, and reduce operational complexity. By replacing various technologies with Postgres, developers can enhance productivity, focus on customer value, and potentially cut costs.
Generally, the principles of separation of concerns and mechanical sympathy should be top of mind when designing a lasting and purposeful database schema.
Finally, since folks often say “I like stripe’s typed random IDs” in these kind of threads: Stripe are lying when they say their IDs are random. They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service. I estimate they typically have 48-64 bits of randomness. That’s still plenty for most systems; you can do the same. Personally I am very fond of base58-encoded AES-encrypted bigserial+HMAC locators with a leading type prefix and a trailing metadata digit, and you can in a pinch even do this inside the database with plv8.
So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?
If your concern is globally unique identifiers (i.e. so that you can merge tables across multiple instances of your database), then UUID is exactly what you want. This is entirely what it is designed for.
If your concern is the privacy of sequence statistics, then UUID incidentally solves your problem. It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.
If you desire a unique identifier that has a high quality UX, then this is in addition to the above columns. This sequence generally has lower entropy than the cryptographic approach (or the UX would suffer), so additional measures should be taken to protect the privacy of the identifiers (e.g. expire redemption tokens after a period of time and reissue via email).
Autoincrementing integers are really nice if you don't actually need UUIDs. Lots of tricks you can apply with them.
I suppose TSID works to this end, but certainly more complicated.
I personally prefer ULID since it is compat with a UUID type and you also get a timestamp lexicographically built into the ID so that sorting by ID also means sorting by timestamp. There are multiple PG extensions to make it easy to drop in and use.
https://datatracker.ietf.org/doc/html/rfc9562
(which wasn't yet finished at the time of the article)
In SQLite, my assumption was that the consensus was towards UUID4 rather than 7 because it meant less likelihood for page cache contention during transaction locks? Would that not also roughly map onto a Postgres-flavored system? Or dues Postgres only have row-level locking?
I'm skeptical that for the vast majority of uses, have time data in the IDs (and also not leaking it from actual timestamp fields) is an issue, and for the cases where it does, just use random IDs.
Even theoretically, is there a way to translate timestamps into a German tank problem? If I give you a sampled set of timestamps (with no other node identifiers), can you estimate the cardinality of the greater dataset?
I was hoping to see it in version 17, but can't get a really good read on what's going on with this feature.
If your dataset is small the overhead from Uuids wont matter, if your dataset is large the randomness of Uuids will save your ass when you migrate to a distributed solution.
Obviously, this has ramifications for data write perf, but may well be worth it depending on the use case. Technically, you could also make the integer alternate (and leave UUID as primary), but that may not de desirable for clustered / index-organized tables (for DBMSes that support them).
par the RFC
If UUIDs are required for use with any security operation within an application context in any shape or form then [RFC4122] UUIDv4 SHOULD be utilized.
- https://commitfest.postgresql.org/48/4388/ (original patch created live https://www.youtube.com/watch?v=YPq_hiOE-N8)
- https://postgres.fm/episodes/uuid
- https://postgres.fm/episodes/partitioning-by-ulid
- https://gitlab.com/postgres-ai/postgresql-consulting/postgre...
TSID: > A Java library for generating Time-Sorted Unique Identifiers (TSID).
Wouldn't this TSID thing be more useful if it were implemented as a set of PostgreSQL stored procedures or something than a Java library? Not everyone uses Java.
They often use tricks like including the MAC address of the generator machine and other ways to increase uniqueness assurances.
It was my understanding that uuids are simply very very unlikely to duplicate in situations with random generation.
It is inspired by Stripe IDs and looks like that: user_2x4y6z8a0b1c2d3e4f5g6h7j8k
I've recently built a postgres extension for it that allows to use it like UUID: https://github.com/blitss/typeid-postgres
FWIW, I do expect UUIDv7 to take roughly half the time to generate if it has roughly half the random bits of UUIDv4 and an unprotected source of entropy is used.
If I don't want to leak the timestamp, I just use an auto generated integer along with it.
Okay but why
Related
Schema changes and the Postgres lock queue
Schema changes in Postgres can cause downtime due to locking issues. Tools like pgroll help manage migrations by handling lock acquisition failures, preventing application unavailability. Setting lock_timeout on DDL statements is crucial for smooth schema changes.
PostgreSQL Statistics, Indexes, and Pareto Data Distributions
Close's Dialer system faced challenges due to data growth affecting performance. Adjusting PostgreSQL statistics targets and separating datasets improved performance. Tips include managing dead rows and optimizing indexes for efficient operation.
Our great database migration
Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.
Eight versions of UUID and when to use them
The article covers eight versions of UUIDs, detailing their characteristics and best use cases. Recommendations include v4 for random IDs, v7 for sortable IDs, and v5 or v8 for custom data. Some versions have been replaced. The author shares insights and hints at a secretive project.
Just Use Postgres for Everything
The article promotes using Postgres extensively in tech stacks to simplify development, improve scalability, and reduce operational complexity. By replacing various technologies with Postgres, developers can enhance productivity, focus on customer value, and potentially cut costs.