September 30th, 2024

Phrase Matching in Marginalia Search

Marginalia Search has updated its engine to support phrase matching, improving accuracy by storing exact term positions. Funded by nlnet, it enhances indexing and ranking algorithms for better search results.

Read original article

Marginalia Search has introduced a significant update to its search engine by implementing phrase matching, enhancing the accuracy of search results for quoted queries. This change allows the search engine to return results where search terms appear in the exact order specified in the query. The previous system relied on an approximate method for storing term positions, which often led to inaccurate results. The new implementation involves storing exact positions of terms in a compressed format, improving the precision of search results. The update was funded by nlnet, which supported the project after an initial grant period. The new system utilizes varint encoding for efficient storage of position data, which has shown to be faster and more CPU-friendly than the previous gamma coding method. Additionally, the search engine now captures more detailed information about word occurrences, allowing for better indexing of documents, including those with code blocks. The ranking algorithm has also been refined to incorporate new factors, such as the presence of search phrases and the proximity of keywords, which enhances the relevance of search results. Overall, these improvements aim to provide users with more accurate and contextually relevant search outcomes.

- Marginalia Search now supports phrase matching for more accurate search results.

- The update was funded by nlnet and took four months to implement.

- The new system uses varint encoding for efficient storage of term positions.

- Enhanced ranking algorithms consider keyword proximity and document context.

- The search engine can now index more detailed information, improving overall search quality.

BM42 – a new baseline for hybrid search

Qdrant introduces BM42, combining BM25 with embeddings to enhance text retrieval. Addressing SPLADE's limitations, it leverages transformer models for semantic information extraction, promising improved retrieval quality and adaptability across domains.

How we improved search results in 1Password

1Password has improved its search functionality by integrating large language models, enhancing accuracy and flexibility while ensuring user privacy. The update retains original search options for user preference.

Launch HN: Undermind (YC S24) – AI agent for discovering scientific papers

Josh and Tom are developing Undermind, a search engine for complex scientific research, using large language models to enhance search accuracy and comprehensiveness, inviting user feedback for improvements.

BMX: A Freshly Baked Take on BM25

Researchers have developed BMX, a new lexical search algorithm that enhances BM25 by integrating similarity and semantic understanding. Extensive tests show BMX outperforms BM25 across various datasets and languages.

What if GitHub had vector search?

GitHub is enhancing its search functionality by integrating Manticore Search for semantic capabilities, improving accuracy and relevance through vector search, and planning a hybrid model for better user experience.

8 comments

By @senkora - 4 months

> turned up nothing but vietnamese computer scientists, and nothing about the famous blog post “ORM is the vietnam of computer science”. [emphasis added]

This points in the direction of the kinds of queries that I tend to use with Marginalia. I've found it to be very helpful in finding well-written blog posts about a variety of subjects, not just technical. I tend to use Marginalia when I am in the mood to find and read such articles.

This is also largely the same reason that I read HN. My current approach is to 1) read HN on a regular schedule, 2) search Marginalia if there is a specific topic that I want, and then 3) add interesting blogs from either to my RSS reader app.

By @ColinHayhurst - 4 months

Congrats Viktor.

> The feedback cycle in web search engine development is very long....Overall the approach taken to improving search result quality is looking at a query that does not give good results, asking what needs to change for that to improve, and then making that change. Sometimes it’s a small improvement, sometimes it’s a huge game changer.

Yes, this resonates with our experience

By @mdaniel - 4 months

Based solely upon the title and the first commit's date, I'm guessing it's this: https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99

By @pmdulaney - 4 months

Amazing! "bicycle touring in France" as a search target produces a huge number of spot-on returns beautifully formatted.

By @gary_0 - 4 months

> To make the most of phrase matching, stop words need to go.

Perhaps I am misunderstanding; does this mean occurrences of stop words like "the" are stored now instead of ignored? That seems like it would add a lot of bloat. Are there any optimizations in place?

Just a shot-in-the-dark suggestion, but if you are storing some bits with each keyword occurrence, can you add a few more bits to store whether the term is adjacent to a common stop word? So maybe if you have to=0 or=1, "to be or not to be" would be able to match the data `0be 1not 0be`, where only "be" and "not" are actual keywords. But the extra metadata bits can be ignored, so pages containing "The Clash" will match both the literal query (via the "the" bit), and just "clash" (without the "the" bit).

By @efilife - 4 months

I wrote my own search engine some time ago and was impressed by how well it worked on my relatively small index. And then I see this. Marginalia's dev is just unmatched with persistence and knowledge to pull all of this off, I wouldn't even know where to start some of the things he did with his search engine

By @hosteur - 4 months

Always nice to see updates on marginalia.

By @arromatic - 4 months

1. Is the index public ? 2. Any chance for a rss feed search ?

Phrase Matching in Marginalia Search