July 12th, 2024

Optimizing the Lichess Tablebase Server

The blog post details optimizations for a Syzygy tablebase server on lichess.org, addressing RAID challenges with dm-integrity on LVM. Improved hardware, RAID 5, pread usage, SSD caching, and parallel reads enhance performance.

Read original articleLink Icon
Optimizing the Lichess Tablebase Server

The blog post discusses the optimization of a 7-piece Syzygy tablebase server on lichess.org. The server faced challenges during RAID integrity checks while handling tablebase requests. To address this, a new approach using dm-integrity on LVM was implemented to passively check blocks when read. A new server setup with improved hardware was used to migrate the tablebases without significant downtime. The post details hardware specifications and the use of RAID 5 for disk failure recovery and load distribution. It also delves into performance benchmarks, focusing on response times and tail latencies for optimizations. Different methods of opening and reading table files are compared, highlighting the benefits of using pread over memory mapping for server implementations. The post also discusses the use of SSD space for caching and parallelizing reads to reduce tail latencies. Overall, the optimizations aim to enhance the server's performance and efficiency in handling tablebase requests on lichess.org.

Link Icon 9 comments
By @imperialdrive - 7 months
Lichess is one of those things you just have to sit and appreciate like a fine wine. It's absolutely wonderful for people in the chess community. I use it every day and am inspired by the functionality and performance, especially knowing it's a 1-2 person shop with limited budget.
By @robbles - 7 months
> here are the empirical distribution functions (ECDFs) with 30ms added to each response time

> The added constant seems artificial, but it's just viewing the results from the point of view of a client with 30ms ping time. Otherwise the log scaled x-axis would overemphasize the importance of a few milliseconds at the low end.

I thought this was interesting - maybe it's a standard practice I was just unaware of but it seems like a smart trick.

By @aeyes - 7 months
Did they have to reduce cost or is there any other reason to not stick 20TB of SSDs in a box and call it a day? 4TB SSDs only cost ~$300, even HP or Dell SFF drives aren't much more expensive.

I guess they were interested in doing the testing and optimization for fun. From a product standpoint I probably would have invested my limited time in other projects.

By @treebeard901 - 7 months
Some questionable choices are made in this optimization.

The reason for the optimization is that there is so much IO activity the RAID checks can't complete.

It is unclear from the article if the RAID checks were ever completed on 17TiB of data. Instead, they choose to disable the periodic RAID checks and instead switch to doing the error checking as a page of data is read in. The two are not equivalent, and both should be used for important data.

Finding corrupt data only as you try to read it can lead to long running data corruptions, maybe to the point your backups do not go back far enough to restore the uncorrupted data. Underpinning this also is a change to RAID 0... While the fastest option, they are putting a lot of faith in that NVMe config handling that kind of workload.

Hope they have good backups...

EDIT: A good way to solve this is to spin up a temporary server, restore your backups to it, do the full data checks and when successful, you have also checked your backup and restore process along with the integrity of the file. You still want to have enough overhead available to complete the RAID checks on the primary server and don't use RAID 0 for performance.

By @29athrowaway - 7 months
There is also lishogi but it is smaller enough to not require such optimizations yet.

Shogi is the most entertaining for chess variants. Xiangqi not as much.

By @everyone - 7 months
A lichess is a female lich I'm assuming? (It's like baron / baroness)
By @hocuspocus - 7 months
I know it's not a fair comparison but I'm truly impressed by the quality of engineering shown by the Lichess team, when their main competitor was for example boasting about a migration to GCP and yet suffering from repeated outages due to fairly organic growth in popularity. While I believe they employ 100x more people.

Lichess' mobile app was a weak spot, however the v2 rewrite in Flutter is already pretty good while still in beta.

And keep in mind Thibault pays himself less than 60k/year.