Show HN: Semantic Grep – A Word2Vec-powered search tool
sgrep is a command-line tool for semantic searches using word embeddings, enhancing grep functionality. It supports context display, configurable thresholds, and requires a Word2Vec model for operation.
Read original articlesgrep is a command-line utility designed for semantic searches using word embeddings, enhancing traditional grep functionality. It allows users to find semantically similar words to a given query, providing context and line numbers for matches. For example, to search for words similar to "death" in Hemingway's "The Old Man and the Sea," users can utilize a specific command that fetches the text and displays results with context. Key features include semantic search with Word2Vec embeddings, a configurable similarity threshold, context display, color-coded output, and support for reading from files or standard input.
Installation can be done by downloading the binary release or building from source, with specific commands provided for cloning the repository and building the tool. Command-line options allow users to specify the Word2Vec model path, set similarity thresholds, and control the number of context lines displayed. Configuration is possible through a JSON file, with a default named config.json, where users can specify the model path.
sgrep requires a Word2Vec model in binary format, with options to use pre-trained models or train custom ones. The repository includes a script to download a slim version of the Google Word2Vec model. Contributions to the project are encouraged, and it is licensed under the MIT License. More information can be found on the sgrep GitHub repository.
Related
Surprise, your data warehouse can RAG
A blog post by Maciej Gryka explores "Retrieval-Augmented Generation" (RAG) to enhance AI systems. It discusses building RAG pipelines, using text embeddings for data retrieval, and optimizing data infrastructure for effective implementation.
The Windows Console gets support for Sixel images
The Microsoft Terminal GitHub repository features a spell checking configuration with files like allow/*.txt, reject.txt, excludes.txt, patterns/*.txt, candidate.patterns, line_forbidden.patterns, expect/*.txt, and advice.md for various purposes. Detailed formats available.
GraphRAG (from Microsoft) is now open-source!
GraphRAG, a GitHub tool, enhances question-answering over private datasets with structured retrieval and response generation. It outperforms naive RAG methods, offering semantic analysis and diverse, comprehensive data summaries efficiently.
Korvus: Single-Query RAG with Postgres
Korvus is a search SDK merging RAG pipeline into a Postgres query, using Python, JavaScript, and Rust bindings. It streamlines search processes, minimizes infrastructure needs, and offers detailed documentation on GitHub.
Bash-Oneliners: A collection of terminal tricks for Linux
The GitHub repository compiles Bash one-liners and commands for bioinformatics and cloud computing, covering terminal tricks, variable manipulation, text processing, networking commands, and system maintenance for improved command-line proficiency.
- Users express excitement about the potential of semantic search and its applications.
- Several commenters suggest performance enhancements, such as multi-CPU support and faster similarity computations using BLAS.
- There are inquiries about the tool's compatibility with different languages and models, as well as concerns about the availability of Word2Vec models.
- Some users share their experiences and challenges with existing tools, highlighting the need for better configuration options.
- Overall, the community shows a strong interest in exploring and enhancing the capabilities of sgrep.
https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...
You can read the vector all at once. See e.g.:
https://github.com/danieldk/go2vec/blob/ee0e8720a8f518315f35...
---
https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...
You can compute the similarity much faster by using BLAS. Good BLAS libraries have SIMD-optimized implementations. Or if you do multiple tokens as once, you can do a matrix-vector multiplication (sgemv), which will be even faster in many implementations. Alternatively, there is probably also a SIMD implementation in Go using assembly (it has been 7 years since I looked at anything in the Go ecosystem).
You could also normalize the vectors while loading. Then during runtime the cosine similarity is just the dot product of the vectors (whether it pays off depends on the size of your embedding matrix and the size of the haystack that you are going to search).
The configuration thing is unclear to me. I think that "current directory" means "same directory as the binary", but it could mean pwd.
Neither of those is good: configuration doesn't belong where the binaries go, and it's obviously wrong to look for configs in the working directory.
I suggest checking $XDG_CONFIG_HOME, and defaulting to `~/.config/sgrep/config.toml`.
That extension is not a typo, btw. JSON is unpleasant to edit for configuration purposes, TOML is not.
Or you could use an ENV variable directly, if the only thing that needs configuring is the model's location, that would be fine as well.
If that were the on ramp, I'd be giving feedback on the program instead. I do think it's a clever idea and I'd like to try it out.
Curious would it handle negation of trained keywords, e.g "not urgent"?
Alas the word2vec repository has reached its quota:
fetch: Fetching reference refs/heads/master
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/mmihaltz/word2vec-GoogleNews-vectors.git/info/lfs'
So here are another sources I found for it: https://stackoverflow.com/a/43423646I also found https://huggingface.co/fse/word2vec-google-news-300/tree/mai... but I'm unsure if that's the correct format for this tool. The first source from Google Drive seems to work and there's little chance of being malicious..
Like grep but for natural language questions. Based on Mistral LLMs.
Do I understand correctly that this works by splitting each line into words, and using the embedding for each word?
I wonder whether it might be feasible to search by semantics of longer sequences of text, using some language model (like, one of the smaller ones, like GPT2-small or something?). Like, so that if you were searching for “die”, then “kick the bucket” and “buy the farm”, could also match somehow? Though, I’m not sure what vector you would use to do the dot product with, when there is a sequence of tokens, each with associated key vectors for each head at each layer, rather than a single vector associated with a word.. Maybe one of the encoder-decoder models rather than the decoder only models?
Though, for things like grep, one probably wants things to be very fast and as lightweight as feasible, which I imagine is much more the case with word vectors (as you have here) than it would be using a whole transformer model to produce the vectors.
Maybe if one wanted to catch words that aren’t separated correctly, one could detect if the line isn’t comprised of well-separated words, and if so, find all words that appear as a substring of that line? Though maybe that would be too slow?
vector search is the first step of the process, and i argue that getting top n results and letting the user process them is probably the best of both worlds without involving "AI".
this can be enhanced with multilingual support based on the encoder used. but building the index is still an expensive process and i wonder how that could be done fast for a local user.
Just in general, semantic search across text seems like a much better ux for many applications. Wish it was more prevalent.
Having no experience with word2vec, some reference performance numbers would be great. If I have one million PDF pages, how long is that going to take to encode? How long will it take to search? Is it CPU only or will I get a huge performance benefit if I have a GPU?
Related
Surprise, your data warehouse can RAG
A blog post by Maciej Gryka explores "Retrieval-Augmented Generation" (RAG) to enhance AI systems. It discusses building RAG pipelines, using text embeddings for data retrieval, and optimizing data infrastructure for effective implementation.
The Windows Console gets support for Sixel images
The Microsoft Terminal GitHub repository features a spell checking configuration with files like allow/*.txt, reject.txt, excludes.txt, patterns/*.txt, candidate.patterns, line_forbidden.patterns, expect/*.txt, and advice.md for various purposes. Detailed formats available.
GraphRAG (from Microsoft) is now open-source!
GraphRAG, a GitHub tool, enhances question-answering over private datasets with structured retrieval and response generation. It outperforms naive RAG methods, offering semantic analysis and diverse, comprehensive data summaries efficiently.
Korvus: Single-Query RAG with Postgres
Korvus is a search SDK merging RAG pipeline into a Postgres query, using Python, JavaScript, and Rust bindings. It streamlines search processes, minimizes infrastructure needs, and offers detailed documentation on GitHub.
Bash-Oneliners: A collection of terminal tricks for Linux
The GitHub repository compiles Bash one-liners and commands for bioinformatics and cloud computing, covering terminal tricks, variable manipulation, text processing, networking commands, and system maintenance for improved command-line proficiency.