July 27th, 2024

Show HN: Semantic Grep – A Word2Vec-powered search tool

sgrep is a command-line tool for semantic searches using word embeddings, enhancing grep functionality. It supports context display, configurable thresholds, and requires a Word2Vec model for operation.

Read original articleLink Icon
CuriosityAppreciationEnthusiasm
Show HN: Semantic Grep – A Word2Vec-powered search tool

sgrep is a command-line utility designed for semantic searches using word embeddings, enhancing traditional grep functionality. It allows users to find semantically similar words to a given query, providing context and line numbers for matches. For example, to search for words similar to "death" in Hemingway's "The Old Man and the Sea," users can utilize a specific command that fetches the text and displays results with context. Key features include semantic search with Word2Vec embeddings, a configurable similarity threshold, context display, color-coded output, and support for reading from files or standard input.

Installation can be done by downloading the binary release or building from source, with specific commands provided for cloning the repository and building the tool. Command-line options allow users to specify the Word2Vec model path, set similarity thresholds, and control the number of context lines displayed. Configuration is possible through a JSON file, with a default named config.json, where users can specify the model path.

sgrep requires a Word2Vec model in binary format, with options to use pre-trained models or train custom ones. The repository includes a script to download a slim version of the Google Word2Vec model. Contributions to the project are encouraged, and it is licensed under the MIT License. More information can be found on the sgrep GitHub repository.

AI: What people are saying
The comments on the sgrep tool reveal a mix of interest and suggestions for improvement.
  • Users express excitement about the potential of semantic search and its applications.
  • Several commenters suggest performance enhancements, such as multi-CPU support and faster similarity computations using BLAS.
  • There are inquiries about the tool's compatibility with different languages and models, as well as concerns about the availability of Word2Vec models.
  • Some users share their experiences and challenges with existing tools, highlighting the need for better configuration options.
  • Overall, the community shows a strong interest in exploring and enhancing the capabilities of sgrep.
Link Icon 23 comments
By @danieldk - 3 months
Some small tips from superficially reading the code:

https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...

You can read the vector all at once. See e.g.:

https://github.com/danieldk/go2vec/blob/ee0e8720a8f518315f35...

---

https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...

You can compute the similarity much faster by using BLAS. Good BLAS libraries have SIMD-optimized implementations. Or if you do multiple tokens as once, you can do a matrix-vector multiplication (sgemv), which will be even faster in many implementations. Alternatively, there is probably also a SIMD implementation in Go using assembly (it has been 7 years since I looked at anything in the Go ecosystem).

You could also normalize the vectors while loading. Then during runtime the cosine similarity is just the dot product of the vectors (whether it pays off depends on the size of your embedding matrix and the size of the haystack that you are going to search).

By @onli - 3 months
That's totally clever and sound really useful. And it's one of those ideas where you go "Why didn't I think of that" when stumbling over the materials, word2vec in this case.
By @samatman - 3 months
This is a good idea. I'm going to offer some unsolicited feedback here:

The configuration thing is unclear to me. I think that "current directory" means "same directory as the binary", but it could mean pwd.

Neither of those is good: configuration doesn't belong where the binaries go, and it's obviously wrong to look for configs in the working directory.

I suggest checking $XDG_CONFIG_HOME, and defaulting to `~/.config/sgrep/config.toml`.

That extension is not a typo, btw. JSON is unpleasant to edit for configuration purposes, TOML is not.

Or you could use an ENV variable directly, if the only thing that needs configuring is the model's location, that would be fine as well.

If that were the on ramp, I'd be giving feedback on the program instead. I do think it's a clever idea and I'd like to try it out.

By @throw156754228 - 3 months
The model of a word to a vector breaks down really quickly one you introduce the context and complexity of human language. That's why we went to contextual embeddings, but even they have issues.

Curious would it handle negation of trained keywords, e.g "not urgent"?

By @_flux - 3 months
I wonder if it would be possible to easily add support for multiple CPUs? It seems to be taking at most 150% CPU, so on my workstation it could be (assuming high parallellism) 10 times as fast.

Alas the word2vec repository has reached its quota:

    fetch: Fetching reference refs/heads/master
    batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
    error: failed to fetch some objects from 'https://github.com/mmihaltz/word2vec-GoogleNews-vectors.git/info/lfs'

So here are another sources I found for it: https://stackoverflow.com/a/43423646

I also found https://huggingface.co/fse/word2vec-google-news-300/tree/mai... but I'm unsure if that's the correct format for this tool. The first source from Google Drive seems to work and there's little chance of being malicious..

By @randcraw - 3 months
This would be really useful if it could take a descriptive phrase or a compound phrase (like SQL 'select X and Y and Z') and match against the semantic cluster(s) that the query forms. IMO that's the greatest failing of today's search engines -- they're all one hit wonders.
By @fzeindl - 3 months
Fyi, there is already a widely used tool along with a company called semgrep, which stems from semantic grep: https://semgrep.dev/.
By @molli - 3 months
Similar: https://github.com/moritztng/fltr

Like grep but for natural language questions. Based on Mistral LLMs.

By @drdeca - 3 months
Very cool!

Do I understand correctly that this works by splitting each line into words, and using the embedding for each word?

I wonder whether it might be feasible to search by semantics of longer sequences of text, using some language model (like, one of the smaller ones, like GPT2-small or something?). Like, so that if you were searching for “die”, then “kick the bucket” and “buy the farm”, could also match somehow? Though, I’m not sure what vector you would use to do the dot product with, when there is a sequence of tokens, each with associated key vectors for each head at each layer, rather than a single vector associated with a word.. Maybe one of the encoder-decoder models rather than the decoder only models?

Though, for things like grep, one probably wants things to be very fast and as lightweight as feasible, which I imagine is much more the case with word vectors (as you have here) than it would be using a whole transformer model to produce the vectors.

Maybe if one wanted to catch words that aren’t separated correctly, one could detect if the line isn’t comprised of well-separated words, and if so, find all words that appear as a substring of that line? Though maybe that would be too slow?

By @rldjbpin - 3 months
document search, an llm use-case almost all companies want for their white-collar workers, currently follows creation of a RAG.

vector search is the first step of the process, and i argue that getting top n results and letting the user process them is probably the best of both worlds without involving "AI".

this can be enhanced with multilingual support based on the encoder used. but building the index is still an expensive process and i wonder how that could be done fast for a local user.

By @sam_perez - 3 months
This tool seems really cool, want to play with it for sure.

Just in general, semantic search across text seems like a much better ux for many applications. Wish it was more prevalent.

By @gunalx - 3 months
Really cool. Often just want to fuzzy search for a word, and this would be useful. Can it do filenames as well ? Or do I need to pipe something like LS first.
By @fbdab103 - 3 months
I might have a work use case for which this would be perfect.

Having no experience with word2vec, some reference performance numbers would be great. If I have one million PDF pages, how long is that going to take to encode? How long will it take to search? Is it CPU only or will I get a huge performance benefit if I have a GPU?

By @rasengan0 - 3 months
very cool, led me to find https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html and semgrep is taken so another symlink it is, w2vgrep?
By @infruset - 3 months
Very nice ! How might one go about adapting this to other languages ? Does a version of the model downloaded exist somewhere ?
By @pgroves - 3 months
How fast is it?
By @piyushtechsavy - 3 months
Hello Arun. Does it use cosine similarity for the matching?
By @synergy20 - 3 months
just played around with it, not very smart per se, to get full power of semantic grep, LLM still might be needed, how to do it? is RAG the only way?
By @sitkack - 3 months
Your post might have been flagged because of your example?
By @tgw43279w - 3 months
I really like how simple the implementation is!
By @shubham13596 - 3 months
Why not use GPT 4 embeddings instead of word2vec ? Won't that be more effective
By @low_tech_punk - 3 months
not to be confused with https://github.com/semgrep/semgrep
By @danielmarkbruce - 3 months
damn, great thinking.