When Imperfect Systems Are Good, Actually: Bluesky's Lossy Timelines
Bluesky's "Lossy Timelines" enhances write performance by reducing consistency, addressing "hot shards" issues, and implementing a mechanism to drop writes, significantly improving latency and scalability for high-follow accounts.
Read original articleBluesky has implemented a new system design approach called "Lossy Timelines" to enhance the performance of its Following Feed/Timeline. The challenge in system design lies in balancing various properties such as consistency, availability, and latency. Bluesky's recent changes aimed to improve write performance while accepting a trade-off in consistency. The platform's Timelines database, which serves around 32 million users, faced issues with "hot shards" due to users following excessively large numbers of accounts, leading to performance bottlenecks. By introducing a mechanism to probabilistically drop writes to a user's Timeline based on their number of follows, Bluesky has effectively limited the workload on its database shards. This approach has resulted in a significant reduction in the P99 latency of Fanout operations, decreasing the time taken for large accounts from several minutes to under ten seconds. Additionally, caching strategies have been employed to manage high-follow accounts efficiently. Overall, these changes have improved the scalability and throughput of Bluesky's Timelines, allowing the service to maintain user expectations without compromising performance.
- Bluesky's "Lossy Timelines" improves write performance by accepting reduced consistency.
- The system addresses issues with "hot shards" caused by users following too many accounts.
- The new mechanism probabilistically drops writes to manage database workload.
- P99 latency for Fanout operations has decreased by over 90%, enhancing user experience.
- Caching strategies have been implemented to efficiently handle high-follow accounts.
Related
How Bluesky, Alternative to X and Facebook, Is Handling Growth
Bluesky, a decentralized social media platform launched in February 2023, has surpassed 15 million users amid rapid growth, facing challenges like outages while promoting user control and developer engagement.
Bluesky, eXit, and vague thoughts about self-hosting
The migration from Twitter/X to Bluesky has boosted user engagement, with the author's followers increasing significantly. They are exploring self-hosting options and seeking simpler integration methods while continuing manual posting.
How decentralized is Bluesky really?
Bluesky is gaining popularity as an alternative to X-Twitter, but it faces concerns over centralization and increasing resource requirements, despite positive leadership and user-friendly features.
The Rise of Bluesky
Bluesky is gaining popularity as a user-friendly alternative to Twitter, offering chronological feeds and features like "Starter Packs," attracting users, especially in the scientific community, though sustainability remains uncertain.
How Bluesky Works
Bluesky is a decentralized social network using a federated architecture and the Authenticated Transfer Protocol for data sharing, featuring user-controlled moderation, customizable feeds, and efficient querying with advanced data structures.
- Many commenters express curiosity about the balance between performance and user experience, particularly regarding the impact of lossy timelines on content visibility.
- Several users suggest alternative strategies for managing timelines, such as hybrid approaches or dynamic fan-out methods to improve efficiency.
- There is a recognition of the technical challenges involved in scaling social media platforms, with some users drawing comparisons to other systems like Twitter and Nostr.
- Concerns are raised about the potential negative user experience due to dropped posts, especially for users following a large number of accounts.
- Commenters appreciate the technical insights shared in the article, highlighting the importance of quality information in discussions about system design.
When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.
In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.
Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.
Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.
The lossy timeline solution basically means you skip updating the feed for some people who are above the number of reasonable followers. I get that
Seeing them get 96% improvements is insane, does that mean they have a ton of users following an unreasonable number of people or do they just have a very low number for reasonable followers. I doubt it's the latter since that would mean a lot of people would be missing updates.
How is it possible to get such massive improvements when you're only skipping a presumably small % of people per new post?
EDIT: nvm, I rethought about it, the issue is that a single user with millions of follows will constantly be written to which will slow down the fanout service when a celebrity makes a post since you're going through many db pages.
Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but
- the read is still colocated inside the shard, so latency remains low
- for mega-followers the page will not see older entries anyway
There are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)
It's insanely frustrating.
Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x
While I'm fine with the solution, the wording of this sentence led me to believe that the solution was going to be imperfect chronology, not dropped posts in your feed.
However, I do love reading about the technical challenge. I think Twitter has a special architecture for celebrities with millions of followers. Given Bluesky is a quasi-clone, I wonder why they did not follow in these footsteps.
The current solution is for everyone to use the same few relays, which is basically a polite nod to Bluesky's architecture. The long-term solution is—well it involves a lot of relay hint dropping and a reliance on Japanese levels of acuity when it comes to picking up on hints (among clinets). But (a) it's proving extreme slow going and (b) it only aims to mitigate the "global as relates to me" problem.
https://aws.amazon.com/builders-library/workload-isolation-u...
The basic idea is to assign each user to multiple shards, decreasing the changes of another user sharing all their shards with the badly behaving user.
Fixing this issue as described in the article makes sense, but if they did shuffle sharding in the first place it would cover any new issues without effecting many other users.
Lossy indeed.
"In the case of timelines, each “page” of followers is 10,000 users large and each “page” must be fanned out before we fetch the next page. This means that our slowest writes will hold up the fetching and Fanout of the next page."
Basically means that they block on each page, process all the items on the page, and then move on to the next page. Why wouldn't you rather decouple page fetcher and the processing of the pages?
A page fetching activity should be able to continuously keep fetching further set of followers one after another and should not wait for each of the items in the page to be updated to continue.
Something that comes to mind would be to have a fetcher component that fetches pages, stores each page in S3 and publishes the metadata (content) and the S3 location to a queue (SQS) that can be consumed by timeline publishers which can scale independently based on load. You can control the concurrency in this system much better, and you could also partition based on the shards with another system like Kafka by utilizing the shards as keys in the queue to even "slow down" the work without having to effectively drop tweets from timelines (timelines are eventually consistent regardless).
I feel like I'm missing something and there's a valid reason to do it this way.
I must admit, I had some trouble following the author's transition from "celebrity" with many followers to "bot" with many follows. While I assume the work done for a celebrity to scatter a bunch of posts would be symmetric to the work done for a commensurate bot to gather a bunch of posts, I had the impression that the author was introducing an entirely different concept in "Lossy Timelines."
The „reasonable limit“ is likely set to account for such an effect, but I am wondering if a per-user limit based on the activity of the accounts one follows will be an improvement on this approach.
Sounds like Bluesky Pro.
Indeed:
> This means each user gets their own Timeline partition, randomly distributed among shards of our horizontally scalable database (ScyllaDB), replicated across multiple shards for high availability
Long ago, I worked for a dating site. Our CTO at the time was a "guest of honor" who was brought in by a family friend who was working in the marketing at the time. The CTO was a university professor who took on a job as a courtesy (he didn't need the money nor fame, he had enough of both, and actually liked teaching).
But he instituted a lot of experimental practices in the company. S.a. switching roles every now and then (anyone in the company could apply for a different role except administration and try themselves wearing a different hat), or having company-wide discussions of problems where employees would have to prepare a presentation on their current work (that was very unusual at the time, but the practice became more institutional in larger companies afterwards).
Once he announced a contest for the problem he was trying to solve. Since we were building a dating site, the obvious problem was matching. The problem was that the more properties there were to match on, the longer it would take (beside other problems that is). So, the program was punishing site users who took time to fill out the questionnaires as well as they could and favored the "slackers".
I didn't have any bright ideas on how to optimize the matching / search for matches. So, ironically, I asked "what if we just threw away properties beyond certain threshold randomly?" I was surprised that my idea received any traction at all. And the answer was along the lines of "that would definitely work, but I wouldn't know how to explain this behavior to the users". Which, at the time, I took to be yet another eccentricity of the old man... but hey, the idea stuck with me for a long time!
Seriously? Isn't this the nut of your problem right here?
A social media system doesn't need to be perfect at all. It was clear to me from the beginning that Bluesky's feeds aren't very fast, not like they are crazy slow, but if it saves money or effort it's no problem if notifications are delayed 30s.
Related
How Bluesky, Alternative to X and Facebook, Is Handling Growth
Bluesky, a decentralized social media platform launched in February 2023, has surpassed 15 million users amid rapid growth, facing challenges like outages while promoting user control and developer engagement.
Bluesky, eXit, and vague thoughts about self-hosting
The migration from Twitter/X to Bluesky has boosted user engagement, with the author's followers increasing significantly. They are exploring self-hosting options and seeking simpler integration methods while continuing manual posting.
How decentralized is Bluesky really?
Bluesky is gaining popularity as an alternative to X-Twitter, but it faces concerns over centralization and increasing resource requirements, despite positive leadership and user-friendly features.
The Rise of Bluesky
Bluesky is gaining popularity as a user-friendly alternative to Twitter, offering chronological feeds and features like "Starter Packs," attracting users, especially in the scientific community, though sustainability remains uncertain.
How Bluesky Works
Bluesky is a decentralized social network using a federated architecture and the Authenticated Transfer Protocol for data sharing, featuring user-controlled moderation, customizable feeds, and efficient querying with advanced data structures.