Optimizing the Lichess Tablebase Server
The blog post details optimizations for a Syzygy tablebase server on lichess.org, addressing RAID challenges with dm-integrity on LVM. Improved hardware, RAID 5, pread usage, SSD caching, and parallel reads enhance performance.
Read original articleThe blog post discusses the optimization of a 7-piece Syzygy tablebase server on lichess.org. The server faced challenges during RAID integrity checks while handling tablebase requests. To address this, a new approach using dm-integrity on LVM was implemented to passively check blocks when read. A new server setup with improved hardware was used to migrate the tablebases without significant downtime. The post details hardware specifications and the use of RAID 5 for disk failure recovery and load distribution. It also delves into performance benchmarks, focusing on response times and tail latencies for optimizations. Different methods of opening and reading table files are compared, highlighting the benefits of using pread over memory mapping for server implementations. The post also discusses the use of SSD space for caching and parallelizing reads to reduce tail latencies. Overall, the optimizations aim to enhance the server's performance and efficiency in handling tablebase requests on lichess.org.
Related
The FreeBSD-native-ish home lab and network
The author details a complex home lab setup with a FreeBSD server on a laptop, utilizing Jails for services like WordPress and emphasizing security measures and network configurations for efficiency and functionality.
Enabling NVMe Support on Supermicro C7Z97-MF Motherboard
The author details upgrading a Supermicro C7Z97-MF motherboard for NVMe drive support, highlighting speed benefits. The process involves BIOS backup, modified BIOS installation, and performance optimization. NVMe drive enhances system without motherboard replacement.
Our great database migration
Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.
Open-LLM performances are plateauing
The blog addresses Open-LLM's stagnant performance, proposing ways to boost competitiveness. It aims to reinvigorate the community by making the leaderboard more challenging, fostering innovation and improvement.
Troubleshooting my offline Zpool
The author resolved issues with their 8 x 6TB IronWolf RAID-Z2 ZFS array by jiggling SATA power leads, ensuring all drives reappeared. They plan to order a new SATA power cable for a long-term fix, emphasizing systematic troubleshooting.
> The added constant seems artificial, but it's just viewing the results from the point of view of a client with 30ms ping time. Otherwise the log scaled x-axis would overemphasize the importance of a few milliseconds at the low end.
I thought this was interesting - maybe it's a standard practice I was just unaware of but it seems like a smart trick.
I guess they were interested in doing the testing and optimization for fun. From a product standpoint I probably would have invested my limited time in other projects.
The reason for the optimization is that there is so much IO activity the RAID checks can't complete.
It is unclear from the article if the RAID checks were ever completed on 17TiB of data. Instead, they choose to disable the periodic RAID checks and instead switch to doing the error checking as a page of data is read in. The two are not equivalent, and both should be used for important data.
Finding corrupt data only as you try to read it can lead to long running data corruptions, maybe to the point your backups do not go back far enough to restore the uncorrupted data. Underpinning this also is a change to RAID 0... While the fastest option, they are putting a lot of faith in that NVMe config handling that kind of workload.
Hope they have good backups...
EDIT: A good way to solve this is to spin up a temporary server, restore your backups to it, do the full data checks and when successful, you have also checked your backup and restore process along with the integrity of the file. You still want to have enough overhead available to complete the RAID checks on the primary server and don't use RAID 0 for performance.
Shogi is the most entertaining for chess variants. Xiangqi not as much.
Lichess' mobile app was a weak spot, however the v2 rewrite in Flutter is already pretty good while still in beta.
And keep in mind Thibault pays himself less than 60k/year.
Related
The FreeBSD-native-ish home lab and network
The author details a complex home lab setup with a FreeBSD server on a laptop, utilizing Jails for services like WordPress and emphasizing security measures and network configurations for efficiency and functionality.
Enabling NVMe Support on Supermicro C7Z97-MF Motherboard
The author details upgrading a Supermicro C7Z97-MF motherboard for NVMe drive support, highlighting speed benefits. The process involves BIOS backup, modified BIOS installation, and performance optimization. NVMe drive enhances system without motherboard replacement.
Our great database migration
Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.
Open-LLM performances are plateauing
The blog addresses Open-LLM's stagnant performance, proposing ways to boost competitiveness. It aims to reinvigorate the community by making the leaderboard more challenging, fostering innovation and improvement.
Troubleshooting my offline Zpool
The author resolved issues with their 8 x 6TB IronWolf RAID-Z2 ZFS array by jiggling SATA power leads, ensuring all drives reappeared. They plan to order a new SATA power cable for a long-term fix, emphasizing systematic troubleshooting.