July 1st, 2024

A write-ahead log is not a universal part of durability

A write-ahead log (WAL) isn't always essential for database durability. Techniques like fsync, group commit, and checksumming enhance durability. WAL remains cost-effective for ensuring durability in most cases, crucial for database administrators.

Read original article

A write-ahead log is not a universal part of durability

A write-ahead log (WAL) is not always necessary for durability in databases, although it is a common practice. Durability ensures that data is safely written to disk before confirming a client request. Without a WAL, a database can still achieve basic durability by writing directly to disk, but this method is not crash-safe. To enhance durability, databases often use techniques like fsync to flush data to disk. MongoDB, for example, defaults to committing writes on an interval, prioritizing performance over safety. Group commit is another strategy to optimize fsync costs by batching write operations. Additionally, checksumming data on write and using background processes like scrubbing can help detect and recover from data corruption. While some databases require a WAL for specific features like ACID transactions and logical replication, it remains a cost-effective method for ensuring durability in most cases. Understanding the spectrum of durability and the mechanisms involved in achieving it is crucial for database administrators.

Schema changes and the Postgres lock queue

Schema changes in Postgres can cause downtime due to locking issues. Tools like pgroll help manage migrations by handling lock acquisition failures, preventing application unavailability. Setting lock_timeout on DDL statements is crucial for smooth schema changes.

Postgres Schema Changes and Locking

Schema changes in Postgres can cause downtime by locking out reads and writes. Migration tools help mitigate issues. Breakages during migrations can affect client apps or lock database objects, leading to unavailability. Long queries with DDL statements can block operations. Setting lock_timeout on DDL statements can prevent queuing. Tools like pgroll offer backoff and retry strategies for lock acquisition failures. Understanding schema changes and DDL impact helps ensure smoother migrations and less downtime.

SquirrelFS: Using the Rust compiler to check file-system crash consistency

The paper introduces SquirrelFS, a crash-safe file system using Rust's typestate pattern for compile-time operation order enforcement. Synchronous Soft Updates ensure crash safety by maintaining metadata update order. SquirrelFS offers correctness guarantees without separate proofs, quickly verifying crash consistency during compilation. Comparative evaluations show SquirrelFS performs similarly or better than NOVA and WineFS.

Resilient Sync for Local First

The "Local-First" concept emphasizes empowering users with data on their devices, using Resilient Sync for offline and online data exchange. It ensures consistency, security, and efficient synchronization, distinguishing content changes and optimizing processes. The method offers flexibility, conflict-free updates, and compliance documentation, with potential enhancements for data size, compression, and security.

Structured logs are the way to start

Structured logs are crucial for system insight, aiding search and aggregation. Despite storage challenges, prioritizing indexing and retention strategies is key. Valuable lessons can be gleaned from email for software processes.

9 comments

By @pjdesno - 10 months

Unfortunately this post skips over the "atomicity" part of a write-ahead log.

Assume you start with data on disk AAAAAAAA, read it into memory, and update it to BBBBBBBB, then write it back. If you crash in the middle, you might end up with BBBAAAAA, BBBBBBAA, or even some crazy interleaving. (at least for reasonable file sizes - note that the largest atomic write to many NVMe drives is 128K)

If you ditch the in-memory BTree and write a direct-to-disk one, with a lot of care (and maybe a bit of copy-on-write) you can make sure that each disk write leaves the database in a crash-consistent state, but that will cost multiple writes and fsyncs for any database modifications that split or merge BTree nodes - you have to ensure that each write leaves the database in a consistent state.

(for those of you old enough to remember ext2, it had the same problem. If you mounted it async and had a bad crash, the data on disk would be inconsistent - you'd lose data, so you'd vow to always mount your filesystem with synchronous writes so you'd never lose data again, then a few weeks later you'd get tired of the crappy performance and go back to async writes, until the next crash happened, etc. etc.)

The advantage of a log is that it allows you to combine multiple writes to different parts of the database file into a single record, guaranteeing (after crash recovery if necessary) that either all changes happen or none of them do. It serves the same purpose as a mutex in multi-threaded code - if your invariants hold when you get the mutex, and you reestablish them before you drop it, everything will be fine. We'd all love to have a mutex that keeps the system from crashing, but failing that we can use a WAL record to ensure that we move atomically from one valid state to another, without worrying about the order of intermediate changes to the data structure.

By @eatonphil - 10 months

One of the interesting things I came up against while writing this post was the pretty common misconception that SQLite and Postgres will validate your data with checksums [0]. SQLite leaves this to an optional extension (this is a little more commonly known). Here was an HN comment I stumbled on about Postgres [1]:

> Postgres (and other databases that actually care about data integrity) have per page checksums:

Postgres does have support for data checksumming (but not metadata checksumming) but defaults to disabling data checksumming. MongoDB (WiredTiger) on the other hand defaults to checksumming. I was told after publishing this post that MySQL (InnoDB) also does checksum by default but I did not check that in my survey.

Links for proof are in the post.

[0] https://x.com/eatonphil/status/1807572135340134687

[1] https://news.ycombinator.com/item?id=25231308

By @zellyn - 10 months

I found the Developer Voices discussion[1][2] with Joran Dirk Greef, creator of TigerBeetle, to be fascinating. They mentioned that the rigorous correctness proofs that exist for Raft and Paxos assume an absence of disk faults, but that more modern theory includes fixes for that, as long as you marry the log writing and consensus algorithms together properly rather than keeping each as a separate black box.

[1] https://podcasts.apple.com/us/podcast/databases-ambitions-an...

[2] https://www.youtube.com/watch?v=ayG7ltGRRHs

By @AdamProut - 10 months

Maybe good to mention torn pages somewhere too? Both MySQL and Postgres jump through some hoops to both detect them and repair them [1][2]. So, even the scenario in the post where fsync is used to harden writes, the database still needs to handle torn pages (or requires using a file system \ storage that guarantees atomic page writes at the page size the database is using as several managed\cloud databases do).

[1] https://wiki.postgresql.org/wiki/Full_page_writes [2] https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-b...

By @londons_explore - 10 months

Worth noting that there are various tricks to combine the write ahead log and the rest of the on-disk data, meaning the vast majority of data will not be written more than once, and you'll still get the durability and data locality benefits of a WAL.

By @beltsazar - 10 months

But WAL is not for durability, but for atomicity and consistency. And yes, you need to use fsync to ensure durability.

By @refset - 10 months

> durability is a spectrum

This is the truth. My favourite example of this is MemoryDB's "multi-AZ (multi-datacenter) durability" for Valkey - there's a good write-up here https://brooker.co.za/blog/2024/04/25/memorydb.html

By @sgeisenh - 10 months

In the pseudocode snippet that introduces a WAL, the semaphores are signaled before the fsync occurs. This seems like a mistake but maybe there’s a good reason to do this that I am missing?

Edit: it looks like this is the case in the first batched fsync example, as well.

By @jmix - 10 months

This is obvious if you took a DB course, and if you didn't, you have no business building a DB. Sadly, all the NoSQL junkware was built by people who didn't.

A write-ahead log is not a universal part of durability

Related