August 5th, 2024

Postgres stores data on disk – this one's a page turner

PostgreSQL stores data in a structured directory at /var/lib/postgresql/data, containing essential subdirectories and files for database operations, access control, statistics, and transaction management, aiding developers in data optimization.

Read original articleLink Icon
CuriosityInterestConfusion
Postgres stores data on disk – this one's a page turner

PostgreSQL, a widely used relational database, stores its data in a structured directory located at /var/lib/postgresql/data. This directory contains various subdirectories and files essential for database operations. When a fresh PostgreSQL installation is initiated, several folders are created, including 'base' for individual databases, 'global' for cluster-wide tables, and 'pg_wal' for Write-Ahead Log files, which are crucial for data recovery. The database operates through a series of concepts such as database clusters, connections, sessions, transactions, and snapshots, which help manage data integrity and concurrency.

The 'pg_hba.conf' file controls access permissions, while 'pg_stat' and 'pg_stat_tmp' store statistics for query optimization. Other directories like 'pg_commit_ts' and 'pg_serial' manage transaction data, and 'pg_replslot' supports replication for data redundancy. Understanding these components is vital for developers to effectively utilize PostgreSQL, as it allows them to grasp how data is organized and manipulated within the system. The article emphasizes that while databases may seem complex, they are built on fundamental principles that can be understood with some exploration. This knowledge is particularly beneficial for developers looking to optimize their applications and ensure efficient data management.

AI: What people are saying
The comments reflect a diverse range of insights and experiences related to PostgreSQL and database management.
  • Discussion on the technical aspects of PostgreSQL, including data storage formats and logical decoding.
  • Comparisons between PostgreSQL and other database systems, particularly InnoDB and NoSQL solutions.
  • Curiosity about PostgreSQL's design choices, such as page size and data ordering.
  • Technical issues encountered by users, particularly with Docker commands related to PostgreSQL.
  • General appreciation for the article and anticipation for follow-up content.
Link Icon 15 comments
By @dfox - 4 months
> Then, there’s a random byte like 0x25 or 0x07 followed by the column data – the rest of the columns are string types so they’re all stored in UTF-8. If you know what these inter-column bytes mean, leave a comment below! I can’t figure it out.

Next paragraph mentions TOAST and this byte is related to that. The low order bits (on little endian platforms) determine whether the value is stored inline (00, first 4 bytes are total length), is stored in TOAST table (11) or is shorter than 127 bytes (01 for even length, 10 for odd length, the total length is first byte >> 1). So for 0x25 you get 01, so length is 0x25 >> 1 = 18, which is that byte followed by "Equatorial Guinea".

Edit: the reason why endianness matters is that the same representation is also used in memory and the whole first word is interpreted as one length value. The toast tag bits have to be in first byte, which is most easily done as two highest order bits of that word on big endian. That means that it is placed in the two highest bits of the byte.

By @indoordin0saur - 4 months
This URL is blocked by my company's network because of a certain substring in the URL lol
By @sgarland - 4 months
If anyone is interested in contrasting this with InnoDB (MySQL’s default engine), Jeremy Cole has an outstanding blog series [0] going into incredible detail.

[0]: https://blog.jcole.us/innodb/

By @dfox - 4 months
> This process of retrieving the expected database state from the WAL is called logical decoding and Postgres stores files related to this process in here.

While logical decoding is about WAL, it is not related to the recovery process. Logical decoding is a mechanism to convert the WAL entries back into the high-level operations that caused the WAL entries, for example for replication or audit.

By @jmholla - 4 months
Handy tip: instead of Python you can use `xxd` to inspect hex strings:

    printf '{HEX STRING}' | xxd -r -p | xxd
Make sure to omit the leading `\x`. e.g.,

    > printf '0a00000029416e746967756120616e64204261726275646107414709415447093032381d49534f20333136362d323a414713416d657269636173414c6174696e20416d657269636120616e64207468652043617269626265616e1543617269626265616e093031390934313909303239' | xxd -r -p | xxd
    00000000: 0a00 0000 2941 6e74 6967 7561 2061 6e64  ....)Antigua and
    00000010: 2042 6172 6275 6461 0741 4709 4154 4709   Barbuda.AG.ATG.
    00000020: 3032 381d 4953 4f20 3331 3636 2d32 3a41  028.ISO 3166-2:A
    00000030: 4713 416d 6572 6963 6173 414c 6174 696e  G.AmericasALatin
    00000040: 2041 6d65 7269 6361 2061 6e64 2074 6865   America and the
    00000050: 2043 6172 6962 6265 616e 1543 6172 6962   Caribbean.Carib
    00000060: 6265 616e 0930 3139 0934 3139 0930 3239  bean.019.419.029
By @jayde2767 - 4 months
Just curious if anyone else encountered this same error from the initial "docker run" command:

docker: Error response from daemon: create ./pg-data: "./pg-data" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.

By @nemothekid - 4 months
When I started my dev career, NoSQL was the rage and I remember reading about BigTable, Cassandra, Dynamo, and most importantly LSMs. They made a big deal about how the data on stored on disk was sorted. I never knew why this was a big deal but always kept it in mind, but I never bothered to understand how it was done previously.

>Something really important about tables which isn’t obvious at first is that, even though they might have sequential primary keys, tables are not ordered.

This was very surprising to read.

By @nraynaud - 4 months
A bit of curiosity: how did Postgres choose 8k pages? shouldn’t it be the FS page size to help with atomicity?
By @thomasben - 4 months
Easy to read, well written and well structured. Great work Drew !
By @shreddit - 4 months
Why does closing the table of contents open the nav menu on mobile?
By @topherjaynes - 4 months
Honestly, I couldn't click this link quick enough and it delivered. Looking forward to the "oversized values" follow-up!
By @HPsquared - 4 months
Title missing the leading "How"
By @mharig - 4 months
> Can’t we just store some data on disk and read / write from it when we need to? (Spoiler: no.)

I disagree. SQLite does a good job in uniting the 2 worlds: complex SQL queries with excellent data consistency and simple file(s). Although SQLite is for sure not the one size fits all solution.

By @llimllib - 4 months
@drewsberry: I wish you had an RSS feed! I tried to subscribe to your blog but if there is one it's not linked.

(Enjoyed the post)