300μs typo correction for 1.3M words
Trieve has enhanced its typo correction system, achieving 300 microseconds for correct queries and 5 milliseconds for misspellings, utilizing ClickHouse, BKTree, and Redis for improved search efficiency.
Read original articleTrieve has significantly improved its typo correction system for its Hacker News search engine, reducing the response time from over 30 milliseconds to just 300 microseconds for correctly spelled queries and approximately 5 milliseconds per word for misspellings. The initial version was deemed too slow for practical use, prompting a complete overhaul. The new system utilizes a combination of a dictionary built from 38 million posts, stored in ClickHouse for efficient access, and a Burkhard-Keller Tree (BKTree) data structure for rapid typo correction. The BKTree is constructed by processing chunks of text and is stored in Redis for quick retrieval. The implementation includes a cache layer to minimize the time taken to access the BKTree during searches. Additionally, a preliminary English word identification step ensures that legitimate words are not incorrectly flagged as misspellings. The system also incorporates affix analysis to handle words with prefixes and suffixes. Future enhancements may include query splitting and concatenation features, leveraging the same dictionary lookup capabilities. Overall, the improvements have led to a more responsive and accurate search experience for users.
- Trieve's typo correction system now operates at 300μs for correct queries and ~5ms for misspellings.
- The system uses ClickHouse for dictionary storage and BKTree for efficient typo correction.
- A cache layer in Redis reduces retrieval time for the BKTree during searches.
- Preliminary checks for English words prevent incorrect corrections of valid terms.
- Future plans include implementing query splitting and concatenation features.
Related
Evaluating a Decade of Hacker News Predictions: An Open-Source Approach
The blog post evaluates a decade of Hacker News predictions using LLMs and ClickHouse. Results show a 50% success rate, highlighting challenges in prediction nuances. Future plans include expanding the project. Website: https://hn-predictions.eamag.me/.
turbopuffer: Fast Search on Object Storage
Simon Hørup Eskildsen founded turbopuffer in 2023 to offer a cost-efficient search engine using object storage and SSD caching. Notable customers experienced 10x cost reduction and improved latency. Application-based access.
History of Hacker News Search from 2007 to 2024
The evolution of HackerNews search from 2007 to 2024 includes three generations, with early limitations, the rise of HNSearch, and Algolia's improvements, leading to ongoing innovations with new technologies.
Searching a Codebase in English
Greptile is developing an AI system to improve semantic search in codebases, finding that translating code to natural language and using tighter chunking enhances search accuracy and retrieval quality.
Tbsp – treesitter-based source processing language
tbsp, a tree-based source-processing language, has recently added lists, index expressions, and string manipulation features, improved documentation, and introduced a command-line interface, following its renaming from "trawk."
- Remove the tries. If the detection must work exactly as it does now, you can store a small array of the prefixes and suffixes and just compare branchlessly against all of them. A comparison looks like a vector load, then movemask(cmpeq(prefix, needle) | cmpeq(prefix, all_zeros)) == scalar_all_ones in the case where your needle fits into a vector register (almost all words). edit: you can aggregate consecutive results of individual compares using cmov with the condition being this_mask == scalar_all_ones, but a few branches on this would also be fine. This requires that the prefixes be sorted in descending order by length, so that the first match you find will be the longest.
- Don't heap allocate strings several times during the query.
- Use a cheaper hash than siphash. A reasonable hash for short strings is 2 rounds of AES. ahash is an off-the-shelf hash which will do something like this.
- Use a hash table or hasher interface that lets you separate the hashing from the hash table lookups.
- Use a hash table that lets you prefetch the slot for a given hash. Compute your 3 hashes, prefetch the appropriate slots, then query the strings. If you get the chance to process more than 1 needle at a time, you can hide more latency here - the optimal stride for prefetching ahead of your queries into a large hash table is significantly more than 3.
- stuff about compound words omitted for brevity, but the approach is similar to the above.
The document splitting for 38M hacker news posts also seems like it should complete in a minute or something. Not sure what's going on there.
> 300μs typo correction for 1.3M words
Article:
> 300μs for correctly spelled queries and ~5ms/word for misspellings
Aren't these contradictory?
Related
Evaluating a Decade of Hacker News Predictions: An Open-Source Approach
The blog post evaluates a decade of Hacker News predictions using LLMs and ClickHouse. Results show a 50% success rate, highlighting challenges in prediction nuances. Future plans include expanding the project. Website: https://hn-predictions.eamag.me/.
turbopuffer: Fast Search on Object Storage
Simon Hørup Eskildsen founded turbopuffer in 2023 to offer a cost-efficient search engine using object storage and SSD caching. Notable customers experienced 10x cost reduction and improved latency. Application-based access.
History of Hacker News Search from 2007 to 2024
The evolution of HackerNews search from 2007 to 2024 includes three generations, with early limitations, the rise of HNSearch, and Algolia's improvements, leading to ongoing innovations with new technologies.
Searching a Codebase in English
Greptile is developing an AI system to improve semantic search in codebases, finding that translating code to natural language and using tighter chunking enhances search accuracy and retrieval quality.
Tbsp – treesitter-based source processing language
tbsp, a tree-based source-processing language, has recently added lists, index expressions, and string manipulation features, improved documentation, and introduced a command-line interface, following its renaming from "trawk."