September 30th, 2024

Phrase Matching in Marginalia Search

Marginalia Search has updated its engine to support phrase matching, improving accuracy by storing exact term positions. Funded by nlnet, it enhances indexing and ranking algorithms for better search results.

Read original articleLink Icon
Phrase Matching in Marginalia Search

Marginalia Search has introduced a significant update to its search engine by implementing phrase matching, enhancing the accuracy of search results for quoted queries. This change allows the search engine to return results where search terms appear in the exact order specified in the query. The previous system relied on an approximate method for storing term positions, which often led to inaccurate results. The new implementation involves storing exact positions of terms in a compressed format, improving the precision of search results. The update was funded by nlnet, which supported the project after an initial grant period. The new system utilizes varint encoding for efficient storage of position data, which has shown to be faster and more CPU-friendly than the previous gamma coding method. Additionally, the search engine now captures more detailed information about word occurrences, allowing for better indexing of documents, including those with code blocks. The ranking algorithm has also been refined to incorporate new factors, such as the presence of search phrases and the proximity of keywords, which enhances the relevance of search results. Overall, these improvements aim to provide users with more accurate and contextually relevant search outcomes.

- Marginalia Search now supports phrase matching for more accurate search results.

- The update was funded by nlnet and took four months to implement.

- The new system uses varint encoding for efficient storage of term positions.

- Enhanced ranking algorithms consider keyword proximity and document context.

- The search engine can now index more detailed information, improving overall search quality.

Link Icon 8 comments
By @senkora - 4 months
> turned up nothing but vietnamese computer scientists, and nothing about the famous blog post “ORM is the vietnam of computer science”. [emphasis added]

This points in the direction of the kinds of queries that I tend to use with Marginalia. I've found it to be very helpful in finding well-written blog posts about a variety of subjects, not just technical. I tend to use Marginalia when I am in the mood to find and read such articles.

This is also largely the same reason that I read HN. My current approach is to 1) read HN on a regular schedule, 2) search Marginalia if there is a specific topic that I want, and then 3) add interesting blogs from either to my RSS reader app.

By @ColinHayhurst - 4 months
Congrats Viktor.

> The feedback cycle in web search engine development is very long....Overall the approach taken to improving search result quality is looking at a query that does not give good results, asking what needs to change for that to improve, and then making that change. Sometimes it’s a small improvement, sometimes it’s a huge game changer.

Yes, this resonates with our experience

By @mdaniel - 4 months
Based solely upon the title and the first commit's date, I'm guessing it's this: https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99
By @pmdulaney - 4 months
Amazing! "bicycle touring in France" as a search target produces a huge number of spot-on returns beautifully formatted.
By @gary_0 - 4 months
> To make the most of phrase matching, stop words need to go.

Perhaps I am misunderstanding; does this mean occurrences of stop words like "the" are stored now instead of ignored? That seems like it would add a lot of bloat. Are there any optimizations in place?

Just a shot-in-the-dark suggestion, but if you are storing some bits with each keyword occurrence, can you add a few more bits to store whether the term is adjacent to a common stop word? So maybe if you have to=0 or=1, "to be or not to be" would be able to match the data `0be 1not 0be`, where only "be" and "not" are actual keywords. But the extra metadata bits can be ignored, so pages containing "The Clash" will match both the literal query (via the "the" bit), and just "clash" (without the "the" bit).

By @efilife - 4 months
I wrote my own search engine some time ago and was impressed by how well it worked on my relatively small index. And then I see this. Marginalia's dev is just unmatched with persistence and knowledge to pull all of this off, I wouldn't even know where to start some of the things he did with his search engine
By @hosteur - 4 months
Always nice to see updates on marginalia.
By @arromatic - 4 months
1. Is the index public ? 2. Any chance for a rss feed search ?