February 19th, 2025

When Imperfect Systems Are Good, Actually: Bluesky's Lossy Timelines

Bluesky's "Lossy Timelines" enhances write performance by reducing consistency, addressing "hot shards" issues, and implementing a mechanism to drop writes, significantly improving latency and scalability for high-follow accounts.

Read original article

CuriosityAppreciationConfusion

When Imperfect Systems Are Good, Actually: Bluesky's Lossy Timelines

Bluesky has implemented a new system design approach called "Lossy Timelines" to enhance the performance of its Following Feed/Timeline. The challenge in system design lies in balancing various properties such as consistency, availability, and latency. Bluesky's recent changes aimed to improve write performance while accepting a trade-off in consistency. The platform's Timelines database, which serves around 32 million users, faced issues with "hot shards" due to users following excessively large numbers of accounts, leading to performance bottlenecks. By introducing a mechanism to probabilistically drop writes to a user's Timeline based on their number of follows, Bluesky has effectively limited the workload on its database shards. This approach has resulted in a significant reduction in the P99 latency of Fanout operations, decreasing the time taken for large accounts from several minutes to under ten seconds. Additionally, caching strategies have been employed to manage high-follow accounts efficiently. Overall, these changes have improved the scalability and throughput of Bluesky's Timelines, allowing the service to maintain user expectations without compromising performance.

- Bluesky's "Lossy Timelines" improves write performance by accepting reduced consistency.

- The system addresses issues with "hot shards" caused by users following too many accounts.

- The new mechanism probabilistically drops writes to manage database workload.

- P99 latency for Fanout operations has decreased by over 90%, enhancing user experience.

- Caching strategies have been implemented to efficiently handle high-follow accounts.

How Bluesky, Alternative to X and Facebook, Is Handling Growth

Bluesky, a decentralized social media platform launched in February 2023, has surpassed 15 million users amid rapid growth, facing challenges like outages while promoting user control and developer engagement.

Bluesky, eXit, and vague thoughts about self-hosting

The migration from Twitter/X to Bluesky has boosted user engagement, with the author's followers increasing significantly. They are exploring self-hosting options and seeking simpler integration methods while continuing manual posting.

How decentralized is Bluesky really?

Bluesky is gaining popularity as an alternative to X-Twitter, but it faces concerns over centralization and increasing resource requirements, despite positive leadership and user-friendly features.

The Rise of Bluesky

Bluesky is gaining popularity as a user-friendly alternative to Twitter, offering chronological feeds and features like "Starter Packs," attracting users, especially in the scientific community, though sustainability remains uncertain.

How Bluesky Works

Bluesky is a decentralized social network using a federated architecture and the Authenticated Transfer Protocol for data sharing, featuring user-controlled moderation, customizable feeds, and efficient querying with advanced data structures.

AI: What people are saying

The discussion around Bluesky's "Lossy Timelines" reveals various perspectives on its implementation and implications for user experience.

Many commenters express curiosity about the balance between performance and user experience, particularly regarding the impact of lossy timelines on content visibility.
Several users suggest alternative strategies for managing timelines, such as hybrid approaches or dynamic fan-out methods to improve efficiency.
There is a recognition of the technical challenges involved in scaling social media platforms, with some users drawing comparisons to other systems like Twitter and Nostr.
Concerns are raised about the potential negative user experience due to dropped posts, especially for users following a large number of accounts.
Commenters appreciate the technical insights shared in the article, highlighting the importance of quality information in discussions about system design.

44 comments

By @pornel - 1 day

I wonder why timelines aren't implemented as a hybrid gather-scatter choosing strategy depending on account popularity (a combination of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served).

When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.

By @ChuckMcM - 2 days

As a systems enthusiast I enjoy articles like this. It is really easy to get into the mindset of "this must be perfect".

In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.

Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.

Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.

By @dsauerbrun - 1 day

I'm a bit confused.

The lossy timeline solution basically means you skip updating the feed for some people who are above the number of reasonable followers. I get that

Seeing them get 96% improvements is insane, does that mean they have a ton of users following an unreasonable number of people or do they just have a very low number for reasonable followers. I doubt it's the latter since that would mean a lot of people would be missing updates.

How is it possible to get such massive improvements when you're only skipping a presumably small % of people per new post?

EDIT: nvm, I rethought about it, the issue is that a single user with millions of follows will constantly be written to which will slow down the fanout service when a celebrity makes a post since you're going through many db pages.

By @rakoo - 2 days

Ok I'm curious: since this strategy sacrifices consistency, has anyone thoughts about something that is not full fan-out on reads or on writes ?

Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but

- the read is still colocated inside the shard, so latency remains low

- for mega-followers the page will not see older entries anyway

There are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)

By @spoaceman7777 - 1 day

Hmm. Twitter/X appears to do this at quite a low number, as the "Following" tab is incredibly lossy (some users are permanently missing) at only 1,200 followed people.

It's insanely frustrating.

Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x

By @rconti - 2 days

> Additionally, beyond this point, it is reasonable for us to not necessarily have a perfect chronology of everything posted by the many thousands of users they follow, but provide enough content that the Timeline always has something new.

While I'm fine with the solution, the wording of this sentence led me to believe that the solution was going to be imperfect chronology, not dropped posts in your feed.

By @jadbox - 1 day

So, let's say I follow 4k people in the example and have a 50% drop rate. It seems a bit weird that if all (4k - 1) accounts I follow end up posting nothing in a day, that I STILL have a 50% chance that I won't see the 1 account that posts in a day. It seems to me that the algorithm should consider my feed's age (or the post freshness of my followers). Am I overthinking?

By @ultra-boss - 1 day

Love reading these sorts of "technical problem + solution" pieces. The world does not need more content, in general, but it does need more of this kind of quality information sharing.

By @knallfrosch - 2 days

Anyone following hundreds of thousands of users is obviously a bot account scraping content. I'd ban them and call it a day.

However, I do love reading about the technical challenge. I think Twitter has a special architecture for celebrities with millions of followers. Given Bluesky is a quasi-clone, I wonder why they did not follow in these footsteps.

By @ramblejam - 1 day

Nice problem to have, though. Over on Nostr they're finding it a real struggle to get to the point where you're confident you won't miss replies to your own notes, let alone replies from other people in threads you haven't interacted with.

The current solution is for everyone to use the same few relays, which is basically a polite nod to Bluesky's architecture. The long-term solution is—well it involves a lot of relay hint dropping and a reliance on Japanese levels of acuity when it comes to picking up on hints (among clinets). But (a) it's proving extreme slow going and (b) it only aims to mitigate the "global as relates to me" problem.

By @cavisne - 1 day

AWS has a cool general approach to this problem (one badly behaving user effecting others on their shard)

https://aws.amazon.com/builders-library/workload-isolation-u...

The basic idea is to assign each user to multiple shards, decreasing the changes of another user sharing all their shards with the badly behaving user.

Fixing this issue as described in the article makes sense, but if they did shuffle sharding in the first place it would cover any new issues without effecting many other users.

By @sphars - 2 days

When I go directly to a user's profile and see all their posts, sometimes one of their posts isn't in my timeline where it should be. I follow less than 100 users on Bluesky, but I guess this explains why I occasionally don't see a user's post in my timeline.

Lossy indeed.

By @artee_49 - 2 days

I am a bit perplexed though as to why they have implemented fan-out in a way that each "page" is blocking fetching further pages, they would not have been affected by the high tail latencies if they had not done this,

"In the case of timelines, each “page” of followers is 10,000 users large and each “page” must be fanned out before we fetch the next page. This means that our slowest writes will hold up the fetching and Fanout of the next page."

Basically means that they block on each page, process all the items on the page, and then move on to the next page. Why wouldn't you rather decouple page fetcher and the processing of the pages?

A page fetching activity should be able to continuously keep fetching further set of followers one after another and should not wait for each of the items in the page to be updated to continue.

Something that comes to mind would be to have a fetcher component that fetches pages, stores each page in S3 and publishes the metadata (content) and the S3 location to a queue (SQS) that can be consumed by timeline publishers which can scale independently based on load. You can control the concurrency in this system much better, and you could also partition based on the shards with another system like Kafka by utilizing the shards as keys in the queue to even "slow down" the work without having to effectively drop tweets from timelines (timelines are eventually consistent regardless).

I feel like I'm missing something and there's a valid reason to do it this way.

By @arcastroe - 1 day

I found it odd to base the loss-factor on the number of people you follow, rather than a truer indication of timeline-update-frequency. What if I follow 4k accounts, but each of those accounts only posts once a decade? My timeline would be become unnecessarily lossy.

By @NoGravitas - 1 day

The funny thing is that all of the centralization in Bluesky is defended as being necessary to provide things like global search and all replies in a thread, things that Mastodon simply punts on in the name of decentralization. But then ultimately, Bluesky has to relax those goals after all.

By @skybrian - 2 days

This design makes sense if you didn’t previously have any limit on the number of people an account could follow. But why not have a limit?

By @nasso_dev - 2 days

Interesting! I wonder what value they chose for the `reasonable_limit`.

By @inportb - 1 day

An interesting solution to a challenging problem. Thank you for sharing it.

I must admit, I had some trouble following the author's transition from "celebrity" with many followers to "bot" with many follows. While I assume the work done for a celebrity to scatter a bunch of posts would be symmetric to the work done for a commensurate bot to gather a bunch of posts, I had the impression that the author was introducing an entirely different concept in "Lossy Timelines."

By @thmrtz - 1 day

That’s quite interesting and a challenge I have not thought of. I understand the need for a solution and I believe this works reasonably well, but I am wondering what is happening to users that follow a lot of accounts with below-average activity. This may naturally happen on new social media platforms with people trying out the service and possibly abandoning it.

The „reasonable limit“ is likely set to account for such an effect, but I am wondering if a per-user limit based on the activity of the accounts one follows will be an improvement on this approach.

By @fastest963 - 1 day

To help avoid the hot shard problem, I wonder how capping followers per "timeline" would perform. Especially each user would have a separate timeline per 1000 followers and the client would merge them. You could still do the lossy part, if necessary, by only loading a percent of the actual timelines. That wouldn't help the celebrity problem but it was already acknowledged earlier that the solution to that is to not fan out celebrity accounts.

By @buxidao - 1 day

In the fanout design, why not dynamically move on to the next 10,000-user page as soon as all tasks for the current page are either queued or processing? Would that approach improve throughput, or could it introduce issues like resource contention?

By @mpweiher - about 21 hours

On a related note, I am pretty confident that one of the main reasons the WWW succeeded where previous attempts failed was that it very specifically allowed 404s.

By @KolmogorovComp - 1 day

A simpler option is to put a limit on the number of accounts one’s can follow. Who needs to follow more than 4k followers if not bots?

By @flaburgan - 1 day

The solution to this problem is known and implemented already: the social web should be distributed between thousands of pods which should contain at the maximum a few thousands users. Diaspora is already working like this for 15 years. It is technically harder to build initially but it then divide all the problems (maintenance, moderation, load, censorship, trust of the owner...) Which makes the network much more resilient. Bluesky knows that and they are allowing other people to host their software but they are really not pushing for it and it highly doubt that the experience of a user on a small external pod is the same than on bluesky.com.

By @udioron - 1 day

> some of them will do abnormal things like… well… following hundreds of thousands of other users.

Sounds like Bluesky Pro.

By @robbale - 1 day

the use of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served a good implementations in hot reload scenarios

By @yibg - 1 day

I think something like this was a FB engineering interview (several years ago), just for instagram feeds.

By @Artoooooor - 1 day

Are users informed that they follow too many creators and now they will not see every post on their timelines?

By @dtonon - 1 day

The typical problem of a centralized infrastructure.

Indeed:

> This means each user gets their own Timeline partition, randomly distributed among shards of our horizontally scalable database (ScyllaDB), replicated across multiple shards for high availability

By @Nemo_bis - 1 day

"Lossy timelines" have already been implemented in ActivityPub and Mastodon by design. Will Bluesky ever catch up? It remains to be seen.

By @andsoitis - 1 day

Principle: Progress over perfection.

By @nightpool - 2 days

Note that all of this reflects design decisions on Bluesky's closed-source "AppView" server—any federated servers interacting with Bluesky would need to construct their own timelines, and do not get the benefit of the work described here.

By @trhway - 2 days

So the system design puts the burden on what seems to be synchronous, not queued, writes to get easy reads. I usually prefer simpler cheaper writes at the cost of more complicated reads as the reads scale and parallelize better.

By @JadeNB - 1 day

I understand that it's a different point, but how can someone write a whole essay called "When imperfect systems are good" without once mentioning Gabriel or https://en.wikipedia.org/wiki/Worse_is_better?

By @crabbone - 1 day

Anecdotally, I ran into a similar solution "by chance".

Long ago, I worked for a dating site. Our CTO at the time was a "guest of honor" who was brought in by a family friend who was working in the marketing at the time. The CTO was a university professor who took on a job as a courtesy (he didn't need the money nor fame, he had enough of both, and actually liked teaching).

But he instituted a lot of experimental practices in the company. S.a. switching roles every now and then (anyone in the company could apply for a different role except administration and try themselves wearing a different hat), or having company-wide discussions of problems where employees would have to prepare a presentation on their current work (that was very unusual at the time, but the practice became more institutional in larger companies afterwards).

Once he announced a contest for the problem he was trying to solve. Since we were building a dating site, the obvious problem was matching. The problem was that the more properties there were to match on, the longer it would take (beside other problems that is). So, the program was punishing site users who took time to fill out the questionnaires as well as they could and favored the "slackers".

I didn't have any bright ideas on how to optimize the matching / search for matches. So, ironically, I asked "what if we just threw away properties beyond certain threshold randomly?" I was surprised that my idea received any traction at all. And the answer was along the lines of "that would definitely work, but I wouldn't know how to explain this behavior to the users". Which, at the time, I took to be yet another eccentricity of the old man... but hey, the idea stuck with me for a long time!

By @timewizard - 2 days

> This process involves looking up all of your followers, then inserting a new row into each of their Timeline tables in reverse chronological order with a reference to your post.

Seriously? Isn't this the nut of your problem right here?

By @PaulHoule - 2 days

An airline reservation system has to be perfect (no slack in today's skies), a hotel reservation can be 98% perfect so long as there is some slack and you don't mind putting somebody up in a better room than they paid for from time to time.

A social media system doesn't need to be perfect at all. It was clear to me from the beginning that Bluesky's feeds aren't very fast, not like they are crazy slow, but if it saves money or effort it's no problem if notifications are delayed 30s.

By @bitmasher9 - 2 days

It’s really impressive how well Bluesky is performing. It really feels like a throwback to older social media platforms with its simplicity and lack of dark-patterns. I’m concerned that all the great work on the platform, protocol, etc won’t shine in the long term as they eventually need to find a revenue source.

By @mifydev - 2 days

"Hot Shards in Your Area" - 10/10 heading

By @dang - 2 days

[stub for offtopicness]

By @cush - 1 day

"Hot Shards in Your Area"... I died

By @alexnewman - 1 day

I don’t see much call for blusky anymore….

How Bluesky, Alternative to X and Facebook, Is Handling Growth