July 30th, 2024

Making Machines Move

Fly.io has introduced "Clone-O-Mat," a system for managing stateful applications that allows asynchronous cloning of storage volumes, reducing downtime and improving data integrity during migrations on its cloud platform.

Read original articleLink Icon
Making Machines Move

Fly.io has developed a new system for managing stateful applications on its cloud platform, addressing challenges associated with draining workers that have attached storage volumes. Previously, the process of draining a worker involved either copying data from an original volume to a new one, which was time-consuming and risked data loss, or restoring from backups, which was inadequate due to potential downtime. The solution, termed "Clone-O-Mat," introduces an asynchronous cloning operation that allows for the creation of a new volume while the original remains operational. This method enables a new Fly Machine to be booted with the cloned volume, which initially contains empty blocks. As data is accessed, the system retrieves it from the original volume, a process known as "hydration."

The implementation relies on existing Linux features, specifically the dm-clone functionality, which allows for efficient block-level cloning. The orchestration of this process is managed by a service called flyd, which coordinates the operations across the Fly.io infrastructure. The system has been operational for nearly a year, demonstrating reliability despite the inherent complexities of managing encrypted volumes and ensuring data integrity during migrations. Overall, this innovation significantly reduces downtime and enhances the user experience for applications running on Fly.io's platform.

Related

How Google migrated billions of lines of code from Perforce to Piper

How Google migrated billions of lines of code from Perforce to Piper

Google migrated billions of lines of code from Perforce to Piper over four years to scale and reduce risks. Challenges included dependencies and seamless migration, but the transition was successful, improving operational efficiency.

Booting Linux Off of Google Drive

Booting Linux Off of Google Drive

A programmer's competitiveness leads to booting Linux from Google Drive, facing challenges like networking setup and mounting an Arch Linux root from an S3 bucket. Despite setbacks, Linux boots successfully, integrating Google Drive but facing performance issues and complexities.

Fly.io initiates Region-specific Machines pricing

Fly.io initiates Region-specific Machines pricing

Fly.io is changing pricing for Machines service to region-specific rates over four months, starting in August and settling in November. Users will see per region charges on invoices, with no immediate changes in July. Concerns raised about price hikes, acknowledged display issues, and ongoing talks about commitment discounts.

We saved $5k a month with a single Grafana query

We saved $5k a month with a single Grafana query

Checkly's platform team optimized costs by switching to Kubernetes pods from AWS ECS, reducing pod startup times by 300ms. This led to a 25% decrease in pod usage, saving $5.5k monthly.

FileFlows: Execute actions against files in a tree flow structure

FileFlows: Execute actions against files in a tree flow structure

FileFlows is a versatile tool for processing various file types like text, images, audio, and video. It supports transcoding, converting, and optimizing files, offering detailed reporting and customization options for users.

Link Icon 13 comments
By @mwcampbell - 4 months
I was never a fan of the typical SAN topology, ever since I read Joyent's responses to one of the big early EBS outages in 2011 or 2012. Plus of course, as the article points out, local storage is faster. But Joyent never actually pulled off anything like what Fly has done for migrating volumes between hosts. Congrats on solving the migration problem while maintaining what's good about local storage.
By @zokier - 4 months
I wonder what the io perf will look like during migration. Gut feeling is that going through dm-clone/iscsi/wireguard would be lot slower than direct local nvme.
By @mattbee - 4 months
I designed a similar system 10 years ago at Bytemark which worked for a few thousand VMs, ran for about 12 years. It called BigV [1]. It might still be running (any customers here still?). I think the new owners tried to shut it down but customers kept protesting when offered a less-featureful platform :-)

The two architectural differences from fly:

* the VM clusters were split into "head" and "tail" machines & linked on a dedicated 10Gbps LAN. So each customer VM needed its corresponding head & tail machine to be alive in order to run, but qemu could do all that natively;

* we built our own network storage layer based on NBD called flexnbd [2]. It served local discs to the heads, managed access control and so on. It could also be put into a "mirror" mode where a VM's disc would start writing its blocks out to another server while continuing to serve, keeping track of "dirty" blocks etc. exactly as described here.

It was very handy to be able to sell and directly attach discs of different performance characteristics without having to migrate machines. But I suspect the network (even at 10Gbps) was too much of a bottleneck.

I can't remember whether Linux supported the kind of fancy disc migration we wanted to do back in 2011. If it did, it was hard enough that spending a year getting our own server right seemed worth it.

It is particularly sweet trick to have a suspicion about a server and just say "flush it!" and in 12-24 hours, it's no longer in service. We had tools that most of our support team could use to execute on a slight suspicion. You do notice a performance dip while migrations are going on, but the decision to use network storage (and reduce it overall lol) might have masked that.

Having our discs served from userspace reduced the administration that we needed to do. But it comes with terror of maintaining a piece of C that shuffled our customers data around. Also - because I was a masochist - customers discs were files stored on btrfs and we became reluctant experts. Overall the system was reliable but it took a good 12-18 months of customers tolerating fscks (& us being careful not to anger the filesystem).

I did miss this kind of work in 2022 and interviewed for a support role at fly. I'm not sure how to take being rejected at the screener stage, I'm sure some of my former staff might be able to explain it :)

[1] https://blog.bytemark.co.uk/wp-content/uploads/2012/04/Desig...

[2] https://github.com/BytemarkHosting/flexnbd-c

By @fridder - 4 months
My initial thought was that ZFS replication would be excellent for this but I guess it is not low level enough?
By @siliconc0w - 4 months
It'd be nice if Fly offered a highly available disk. I know you can move HA into the database layer but that is a lot of complexity for their target audience. If you can build all this machinery, you can also probably manage running DRBD.
By @schmichael - 4 months
Why iSCSI instead of NVMEoF?
By @setheron - 4 months
I remember in a past life using DRBD to keep block devices in sync (at RDS AWS).

Is this functionality effectively in-kernel now ?

By @nik736 - 4 months
Am I the only one seeing a lot of advantages with local storage? I don't think it's idiosyncratic at all – that's how DigitalOcean became the company it is today, with simple local storage VMs.

The performance of local NVMes is way better, it is more predictable and you don't have to factor in latency, bandwidth, network issues, bottlenecks and more. Redundancy can be achieved with a multi host setup, so even if the host fails the underlying application or database is not impacted.

The one disadvantage I can see is that you can't "scale"/change the underlying disks. The disks you bought when setting up the server are probably there to stay.

By @beedeebeedee - 4 months
I can't speak to the technical aspects of this product, but makes me think of the movie Ghost in the Machine, and all the horror aspects of AI processes moving between hardware around the world :)
By @dangoodmanUT - 4 months
> When your problem domain is hard, anything you build whose design you can’t fit completely in your head is going to be a fiasco. Shorter form: “if you see Raft consensus in a design, we’ve done something wrong”.

This bothers me a bit. I get what they are saying, but it feels like this assumes they are implementing Raft too. Packages like Dragonboat make it so you don't have to think about Raft, only whether you are the leader or not.

By @rohitpaulk - 4 months
Great content! Sidenote for the Fly team: on mobile, the “sidenote” cards appear in the wrong order - they appear before the content instead of after.
By @neom - 4 months
I really love that fly are calling themselves a cloud provider now!!!!!!!!

I've advised a few startups over the years who were trying to take a stab at "developer focused cloud" and for whatever reason they felt shy to say that, and frankly, I think it's the reason they're no longer around. Fly are bold and I really enjoy how they show the infra side of the engineering.

Handling stateful application migrations with asynchronous data hydration and block-level cloning, A+++ - I've been thinking a lot recently about how if I was ever to build a cloud provider again, I think first focusing on an "intelligent" (read "AI" driven) orchestration system - this would be good generally, but especially around things like global data compliance and sovereignty, I can imagine some pretty sweet features.

By @nolist_policy - 4 months
Qemu can do shared-nothing live migration since a long time.