September 9th, 2024

300μs typo correction for 1.3M words

Trieve has enhanced its typo correction system, achieving 300 microseconds for correct queries and 5 milliseconds for misspellings, utilizing ClickHouse, BKTree, and Redis for improved search efficiency.

Read original articleLink Icon
300μs typo correction for 1.3M words

Trieve has significantly improved its typo correction system for its Hacker News search engine, reducing the response time from over 30 milliseconds to just 300 microseconds for correctly spelled queries and approximately 5 milliseconds per word for misspellings. The initial version was deemed too slow for practical use, prompting a complete overhaul. The new system utilizes a combination of a dictionary built from 38 million posts, stored in ClickHouse for efficient access, and a Burkhard-Keller Tree (BKTree) data structure for rapid typo correction. The BKTree is constructed by processing chunks of text and is stored in Redis for quick retrieval. The implementation includes a cache layer to minimize the time taken to access the BKTree during searches. Additionally, a preliminary English word identification step ensures that legitimate words are not incorrectly flagged as misspellings. The system also incorporates affix analysis to handle words with prefixes and suffixes. Future enhancements may include query splitting and concatenation features, leveraging the same dictionary lookup capabilities. Overall, the improvements have led to a more responsive and accurate search experience for users.

- Trieve's typo correction system now operates at 300μs for correct queries and ~5ms for misspellings.

- The system uses ClickHouse for dictionary storage and BKTree for efficient typo correction.

- A cache layer in Redis reduces retrieval time for the BKTree during searches.

- Preliminary checks for English words prevent incorrect corrections of valid terms.

- Future plans include implementing query splitting and concatenation features.

Link Icon 3 comments
By @anonymoushn - 5 months
To do this fast path in about 1μs you could do this stuff:

- Remove the tries. If the detection must work exactly as it does now, you can store a small array of the prefixes and suffixes and just compare branchlessly against all of them. A comparison looks like a vector load, then movemask(cmpeq(prefix, needle) | cmpeq(prefix, all_zeros)) == scalar_all_ones in the case where your needle fits into a vector register (almost all words). edit: you can aggregate consecutive results of individual compares using cmov with the condition being this_mask == scalar_all_ones, but a few branches on this would also be fine. This requires that the prefixes be sorted in descending order by length, so that the first match you find will be the longest.

- Don't heap allocate strings several times during the query.

- Use a cheaper hash than siphash. A reasonable hash for short strings is 2 rounds of AES. ahash is an off-the-shelf hash which will do something like this.

- Use a hash table or hasher interface that lets you separate the hashing from the hash table lookups.

- Use a hash table that lets you prefetch the slot for a given hash. Compute your 3 hashes, prefetch the appropriate slots, then query the strings. If you get the chance to process more than 1 needle at a time, you can hide more latency here - the optimal stride for prefetching ahead of your queries into a large hash table is significantly more than 3.

- stuff about compound words omitted for brevity, but the approach is similar to the above.

The document splitting for 38M hacker news posts also seems like it should complete in a minute or something. Not sure what's going on there.

By @underyx - 5 months
Title:

> 300μs typo correction for 1.3M words

Article:

> 300μs for correctly spelled queries and ~5ms/word for misspellings

Aren't these contradictory?

By @claytonwramsey - 5 months
It's a little odd that they're using `lazy_static` now that OnceCell and LazyLock are in the standard library.