Building a highly-available web service without a database
A new architecture enables web services to use RAM as a primary data store, enhancing availability with the Raft Consensus algorithm, periodic snapshots, and sharding for efficient scaling.
Read original articleThe blog post discusses a novel architecture for building highly-available web services without relying on traditional databases. The author, Arnold Noronha, highlights the evolution of technology over the past decade, including faster and more robust storage solutions, cheaper RAM, and the introduction of the Raft Consensus algorithm. This architecture allows developers to treat RAM as the primary data store, eliminating the need for complex database interactions and enabling simpler debugging and concurrency management. The process involves taking periodic snapshots of RAM and logging transactions to disk, ensuring data recovery in case of crashes. As startups grow and require higher availability, the Raft protocol can replicate the service across multiple machines, allowing for seamless failover and rolling deployments. The author also mentions the importance of sharding for scaling and shares insights into the technology stack used at Screenshotbot, which includes Common Lisp and custom libraries for managing data and server processes. The architecture is presented as a viable option for startups looking to innovate and scale efficiently.
- The architecture allows treating RAM as a database, simplifying web service development.
- The Raft Consensus algorithm enhances availability by enabling service replication across multiple machines.
- Periodic snapshots and transaction logging ensure data recovery after crashes.
- Sharding is suggested for scaling as customer demands increase.
- The technology stack includes Common Lisp and custom libraries tailored for high concurrency and availability.
Related
Synchronization Is Bad for Scale
Challenges of synchronization in scaling distributed systems include lock contention issues, discouraging lock use in high-concurrency settings. Alternatives like sharding, consistent hashing, and the Saga Pattern are suggested for efficient synchronization. Examples from Mailgun's MongoDB use highlight strategies for avoiding lock contention and scaling effectively, cautioning against excessive database reliance for improved scalability.
Synchronization Is Bad for Scale
Challenges of synchronization in scaling distributed systems are discussed, emphasizing issues with lock contention and proposing alternatives like sharding and consistent hashing. Mailgun's experiences highlight strategies to avoid synchronization bottlenecks.
Building SaaS from Scratch Using Cloud-Native Patterns
Building a robust Cloud platform for SaaS involves utilizing cloud-native patterns like Google Cloud Platform, Azure, and AWS for self-service, scalability, and multi-tenancy. Diagrid's case study emphasizes control plane design, API strategies, and tools like Kubernetes Resource Model for effective resource management. Joni Collinge advises caution in adopting Cloud-Native technologies to align with specific requirements, ensuring adaptability in a competitive landscape.
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.
Understanding Performance Implications of Storage-Disaggregated Databases
Storage-compute disaggregation in databases is gaining traction among major companies. A study at Sigmod 2024 revealed performance impacts, emphasizing the need for buffering and addressing write throughput inefficiencies.
- Many commenters express skepticism about the complexity and potential pitfalls of building a custom in-memory database, suggesting that established solutions like SQL databases are more reliable and easier to manage.
- Several users highlight the challenges of maintaining data consistency, durability, and handling failures, arguing that the proposed architecture may lead to increased technical debt.
- Some commenters appreciate the innovative approach and the desire to experiment with new technologies, but caution against reinventing the wheel when robust solutions already exist.
- There is a recurring theme of concern regarding the long-term viability and scalability of the architecture, especially for production environments.
- Many suggest that using existing databases like SQLite or Redis would provide a simpler and more effective solution for most applications.
The in-memory state can be whatever you want, which means you can build up your own application-specific indexing and querying functions. You could just use sqlite with :memory: for the Raft FSM, but if you can build/find an in-memory transaction store (we use our own go-memdb), then reading from the state is just function calls. Protecting yourself from stale reads or write skew is trivial; every object you write has a Raft index so you can write APIs like "query a follower for object foo and wait till it's at least at index 123". It sweeps away a lot of "magic" that normally you'd shove into a RDBMS or other external store.
That being said, I'd be hesitant to pick this kind of architecture for a new startup outside of the "infrastructure" space... you are effectively building your own database here though. You need to pick (or write) good primitives for things like your inter-node RPC, on-disk persistence, in-memory transactional state store, etc. Upgrades are especially challenging, because the new code can try to write entities to the Raft log that nodes still on the previous version don't understand (or worse, misunderstand because the way they're handled has changed!). There's no free lunch.
That's no longer true, with modern desktop and mobile apps often using a database (usually SQLite) because relational data storage and queries turn out to be pretty useful in a wide range of applications.
Worse is better until you absolutely need to be less worse, then you'll know for sure. At that point you'll know your pain points and can address them more wisely than building more up front.
If your load fits entirely on one server, then just run the database on that damn server and forget about “special architectures to reduce round-trips to your database”. If your data fits entirely in RAM, then use a ramdisk for the database if you want, and replicate it to permanent storage with standard tools. Now that’s actually simple.
The premises are weak and the claims absurd. The author uses overstatement of the difficulties of serialization just to make their weak claim stronger.
I usually start by putting those items into YAML files in a "data" directory. Actually a custom YAML dialect without the quirks of the original. Each value is a string. No magic type conversions. Creating a new item is just "vim crunches.yaml" and putting the data in. Editing, deleting etc all is just wonderfully easy with this data structure.
Then when the project grows, I usually create a DB schema and move the items into MariaDB or SQLite.
This time, I think I will move the items (exercises) into a JSON column of an SQLite DB. All attributes of an item will be stored in a single JSON field. And then write a little DB explorer which lets me edit JSON fields as YAML. So I keep the convenience of editing human readable data.
Writing the DB explorer should be rather straight forward. A bit of ncurses to browse through tables, select one, browse through rows, insert and delete rows. And for editing a field, it will fire up Vim. And if the field is a JSON field, it converts it to YAML before it sends it to Vim and back to JSON when the user quits Vim.
Now, with cheap hardware, they’re going back in time to the benefits of clustered, NUMA machines. They’ve improved on it along the way. I did enjoy the article.
Another trick from the past was eliminating TCP/IP stacks from within clusters to knock out their issues. Solutions like Active Messages were a thin layer on top of the hardware. There’s also designs for network routers that have strong consistency built into them. Quite a few things they could do.
If they get big, there’s hardware opportunities. On CPU side, SGI did two things. Their NUMA machines expanded the number of CPU’s and RAM for one system. They also allowed FPGA’s to plug directly into the memory bus to do custom accelerators. Finally, some CompSci papers modified processor ISA’s, networks on a chip, etc to remove or reduce bottlenecks in multithreading. Also, chips like OpenPiton increase core counts (eg 32) with open, customizable cores.
This exists in sufficiently mature Actor model[0] implementations, such as Akka Event Sourcing[1], which also addresses:
> But then comes the important part: how do you recover when your process crashes? It turns out that answer is easy, periodically just take a snapshot of everything in RAM.
Intrinsically and without having to create "a new architecture for web development". There are even open source efforts which explore the RAFT protocol using actors here[2] and here[3].
0 - https://en.wikipedia.org/wiki/History_of_the_Actor_model
1 - https://doc.akka.io/docs/akka/current/typed/persistence.html
But no, just more lispers.
I think this has to be the number one misunderstanding for developers.
Yes, SSD in terms of throughput or IOPs has gone up by 100 to 10000x. vCPU performance per dollar has gone up by 20 - 50x. We went from 45/32nm to now 5nm/3nm, and much higher IPC.
But RAM price hasn't gotten anywhere near the same fall as CPU or SSD. It may have gotten a lot faster, you may be even getting to stick lots of memory with higher density chip and channels went from dual to 8 or 12. But if you look at the DRAM Spot price since 2008 to 2022, you will see the lowest DRAM price has been the same at around $2.8/GB for three times. As the DRAM price goes in cycle with $8 / $6 per GB in between this same period. i.e Had you bought DRAM at its lowest point or its highest point during the past ~15 years your DRAM would have cost roughly the same plus or minus 10-20% ignoring inflation.
It was only until Mid 2022 it finally broke through the $2.8/GB barrier and collapse close to $1/GB before settling on ~ $2/GB for DDR5.
Yes you can now get 4TB RAM on a server. But it doesn't mean DRAM are super cheap. Developers on average or for those in big Tech are now earning way more than they were in 2010. Which makes them think RAM has gotten a lot more affordable. In reality even in the lowest point over past 15 years you only get at best slightly more than 2x reduction in DRAM price. And we will likely see DRAM price shot up again in a year or two.
https://emoji.boats/ is the most public facing of these.
I also have built a whole class of micro-services that pull their entire dataset from an API on start up, hold it resident and update on occasion. These have been amazing for speeding up certain classes of lookup for us where we don't always need entirely up to date data.
When I asked people working on it if they considered Redis or Mongo or Postgres with jsonb columns, they just said they considered all of those things but decided to roll out their own db anyway because "they understood it better".
This article gives off the same energy. I really hope it works out for you, but IMO spending innovation tokens to build a database is nuts.
Reading posts like this makes me think the founders/CTO is mixing hobby programming with professional programming.
https://youmightnotneedjquery.com/
The overwhelming majority underestimates the beauty and effort as well as experience that goes into abstractions. There are some true geniuses at times doing fantastic work, to deliver syntactical sugar while the critics mock the maybe somewhat larger bundle size for “a couple of lines frequently used.” That’s why.
In the end, a good framework is more than just an abstraction. It guarantees consistency and accessibility.
Try to understand the source code if possible before reinventing the wheel is my advice.
What maybe starts out to be fun quickly becomes a burden. If there weren’t any edge cases or different conditions, you wouldn’t need an abstraction. Been there, done that.
I haven't tried it out but just thinking of how many fewer organizational hoops I would have to jump through makes we want to try it out:
- No ordering a database from database operations.
- No ordering a port opening from network operations.
- No ordering of certificates.
- The above times 3 for development, test and production.
- Not having to run database containers during development.
I think the sweet spot for me would be in services that I don't expect to grow beyond a single node and there is an acceptance for a small amount of downtime during service windows.
Sure you reduce deployment complexity, but what about maintaining your algorithm that implements data persistence and replication?
To assume that will never spectacularly bite you is naive. Tests also only go so far as you know what you are testing for, and while you don't know if your product will ever be used, you also don't know if it will explode in success and you will be hostage of your own decisions and technical debt.
These are HARD decisions. Hard decisions require solid solutions. You can surely try that with toy projects, but if I was in a position to build a software architecture for something that had a remote possibility of being used in production, I would oppose such designs adamantly.
1. I take it you've seen the LMAX talk [0], and were similarly inspired? :)
2. Are you familiar with the event sourcing approach? It's basically what you describe, except you don't flush to disk after editing every field, you batch your updates into a single "event". (you've come at it from the exact opposite end, but it looks like roughly the same thing).
1. Create your own in-memory database.
2. Make sure every transaction in this DB can be serialized and is simultaneously written to disk.
3. Use some orchestration platform to make all web servers aware of each other.
4. Synchronize transaction logs between your web servers (by implementing the Raft protocol) and update the in-memory DB.
5. Write some kind of conflict resolution algorithm, because there's no way to implement locking or enforce consistency/isolation in your DB.
6. Shard your web servers by tenant and write another load balancing layer to make sure that requests are getting to the server their data is on.
Simple indeed.
But I have some bad news: you haven’t built a system without a database, you’ve just built your own database without transactions and weak durability properties.
> Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk.
This is actually not an easy thing to do. If your shutdowns are always clean SIGSTOPs, yes, you can reliably flush writes to disk. But if you get a SIGKILL at the wrong time, or don’t handle an io error correctly, you’re probably going to lose data. (Postgres’ 20-year fsync issue was one of these: https://archive.fosdem.org/2019/schedule/event/postgresql_fs...)
The open secret in database land is that for all we talk about transactional guarantees and durability, the reality is that those properties only start to show up in the very, very, _very_ long tail of edge cases, many of which are easily remedied by some combination of humans getting paged and end users developing workarounds (eg double entry bookkeeping). This is why MySQL’s default isolation level can lose writes: there are usually enough safeguards in any given system that it doesn’t matter.
A lot of what you’re describing as “database issues” problem don’t sound to me like DB issues, so much as latency issues caused by not colocating your service with your DB. By hand-rolling a DB implementation using Raft, you’ve also colocated storage with your service.
> Screenshotbot runs on their CI, so we get API requests 100s of times for every single commit and Pull Request.
I’m sorry, but I don’t think this was as persuasive as you meant it to be. This is the type of workload that, to be snarky about, I could run off my phone[0]
It claims to solve a bunch of problems by ignoring them. There are solid reasons why people distribute their applications across multiple machines. After reading this article I feel like we need to state a bunch of them.
Redundancy - what if one machine breaks either a hardware failure a software failure or a network failure (network partition where you can’t reach the machine or it can’t reach the internet)
Scaling- what if you can’t serve all of your customers from one machine ? Perhaps you have many customers and a small app or perhaps your app can use a lot of resources (maybe it loads gigs of data)
Deployment - what happens when we want to change the code and not go down if you are running multiple copies of your app you get this for cheap
There are tons of smaller benefits - right sizing your architecture What if the one machine you choose is not big enough you need to move to a new machine, with multiple machines you just increase the number of machines. You also get to use a variety of machine sizes and can choose ones that fit your needs so this flexibility allows you to choose cheaper machines
I feel like the authors don’t know why people invented the standard way of doing things.
>
> And so, if your process crashes and restarts, it first reloads the snapshot, and replays the transaction logs to fully recover the state. (Notice that index changes don’t need to be part of the transaction log. For instance if there’s an index on field bar from Foo, then setBar should just update the index, which will get updated whether it’s read from a snapshot, or from a transaction.)
That’s a database. You even linked to the specific database you’re using [0], which describes itself as:
> […] in-memory database with transactions […]
Am I misunderstanding something?
I'm thrilled to see someone try something different, and grateful that he wrote about his positive experiences with it. Perhaps it will turn out to have been the wrong decision, but his writing about it is the only way we'll ever really know. It's so easy to be lulled into a sense of security by doing things the conventional way, and to miss opportunities offered by huge improvements in hardware, well-written open-source libraries, and powerful programming languages.
I have an especially hard time with the idea that SQL is where we all should end up. I've worked at Oracle, and I worked on Google AdWords when it was built on MySQL and InnoDB. I understand SQL's power, but I also understand how constraining it is, not only on data representation, but also on querying. I want to read more posts about people trying to build something without it. Redis is one way, but I'm eager to hear about others.
I wish the author good luck, and encourage him to write another post once Screenshotbot reaches the next stage.
Sound similar to `stop the world Garbage collection` in Java. Does your entire processing comes to halt when you do this? How frequently do you need to take snapshots? Or do you have a way to do this without halting everything
Every single time… it’s always just the wheel being re-written.
2. Are network requests / other ephemeral things also saved to the snapshot?
No transactions, no WAL, no relational schema to keep data design sane, no query planner doing all kinds of optimisations and memory layout things I don't have to think about?
You could say that transactions, for example, would be redundant if there is no external communication between app server and the database. But it is far from the only thing they're useful for. Transactions are a great way of fulfilling important invariants about the data, just like a good strict database schema. You rollback a transaction if an internal error throws. You make sure that transaction data changes get serialised to disk all at once. You remove a possibility that statements from two simultaneous transactions access the same data in a random order (at least if you pick a proper transaction isolation level, which you usually should).
> You also won’t need special architectures to reduce round-trips to your database. In particular, you won’t need any of that Async-IO business, because your threads are no longer IO bound. Retrieving data is just a matter of reading RAM. Suddenly debugging code has become a lot easier too.
Database is far from the only other server I have to communicate with when I'm working on user's HTTP request. As a web developer, I don't think I've worked on a single product in the last 4 years that didn't have some kind of server-server communication for integrations with other tools and social media sites.
> You don’t need crazy concurrency protocols, because most of your concurrency requirements can be satisfied with simple in-memory mutexes and condition variables.
Ah, mutexes. Something that programmers never shot themselves in a foot with. Also, deadlocks don't exist.
> Hold on, what if you’ve made changes since the last snapshot? And this is the clever bit: you ensure that every time you change parts of RAM, we write a transaction to disk. So if you have a line like foo.setBar(2), this will first write a transaction that says we’ve changed the bar field of foo to 2, and then actually set the field to 2. An operation like new Foo() writes a transaction to disk to say that a Foo object was created, and then returns the new object.
A disk write latency is added to every RAM write. It has no performance cost and nobody notices this.
I apologise if this comes off too snarky. Despite all of the above, I really like this idea — and already think of implementing it in a hobby project, just to see how well it really works. I'm still not sure if it's practical, but I love the creative thinking behind this, and a fact that it actually helped them build a business.
> Imagine all the wonderful things you could build if you never had to serialize data into SQL queries.
You can do all those "wonderful things" with an RDBMS too, it's just an additional step.
> First, you don’t need multiple front-end servers talking to a single DB, just get a bigger server with more RAM and more CPU if you need it.
You don't "need" that with a single DB too, you can also get a bigger machine. Also, you can use SQLite and Litestream.
> What about indices? Well, you can use in-memory indices, effectively just hash-tables to lookup objects. You don’t need clever indices like B-tree that are optimized for disk latency.
RDMBS provide all kind of indices. You don't need to build them in your code or re-invent clever solutions. They're all there. Optimized and battle-tested over decades.
> You also won’t need special architectures to reduce round-trips to your database.
You don't need "special architectures" at all. With the most simple setup you get thousands to requests per second and sub 5 ms latency. With SQLite even more. No need for async IO, threads scale well enough for most apps. Anyway, async is not a magical thing.
> You don’t need any services to run background jobs, because background jobs are just threads running in this large process.
How does this change when using an RDBMS?
> You don’t need crazy concurrency protocols, because most of your concurrency requirements can be satisfied with simple in-memory mutexes and condition variables.
I trust a proper proven implementation in SQLite or Postgres much more than "simple in-memory mutexes and condition variables". One reason why Rust is so popular is that it's an eye opener when the compiler shows you all your concurrency bugs you never thought you had in your code.
---------------------
RDBMS solve / support may important things the easy way
- normalized data modelling by refs and joins
- querying, filtering and aggregating data
- concurrency
- storage
Re-inventing those is most of the time much harder, error prone and expensive.
---------------------
The simplest, easy and proven way is still to use an RDBMS. Start with SQLite and Litestream if you don't want to manage Postgres, which is a substantial effort, I admit. Or cost factor, although something like Neon / Supabase / ... for small scale is much much much cheaper than the development costs for all the stuff above.
otherwise I get the messaging with edge you the database is the bottleneck
just need a one stop shop to do edge functions + edge db
Related
Synchronization Is Bad for Scale
Challenges of synchronization in scaling distributed systems include lock contention issues, discouraging lock use in high-concurrency settings. Alternatives like sharding, consistent hashing, and the Saga Pattern are suggested for efficient synchronization. Examples from Mailgun's MongoDB use highlight strategies for avoiding lock contention and scaling effectively, cautioning against excessive database reliance for improved scalability.
Synchronization Is Bad for Scale
Challenges of synchronization in scaling distributed systems are discussed, emphasizing issues with lock contention and proposing alternatives like sharding and consistent hashing. Mailgun's experiences highlight strategies to avoid synchronization bottlenecks.
Building SaaS from Scratch Using Cloud-Native Patterns
Building a robust Cloud platform for SaaS involves utilizing cloud-native patterns like Google Cloud Platform, Azure, and AWS for self-service, scalability, and multi-tenancy. Diagrid's case study emphasizes control plane design, API strategies, and tools like Kubernetes Resource Model for effective resource management. Joni Collinge advises caution in adopting Cloud-Native technologies to align with specific requirements, ensuring adaptability in a competitive landscape.
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.
Understanding Performance Implications of Storage-Disaggregated Databases
Storage-compute disaggregation in databases is gaining traction among major companies. A study at Sigmod 2024 revealed performance impacts, emphasizing the need for buffering and addressing write throughput inefficiencies.