September 2nd, 2024

Understanding Pgvector's HNSW Index Storage in Postgres

The article analyzes the HNSW index in pgvector for PostgreSQL, detailing its structure, metadata, optimizations for space efficiency, and a C parser that converts the index into JSON for visualization.

Read original articleLink Icon
Understanding Pgvector's HNSW Index Storage in Postgres

The article provides an in-depth analysis of the HNSW (Hierarchical Navigable Small World) index storage mechanism used by pgvector in PostgreSQL. It begins with a brief overview of how PostgreSQL organizes data on disk, emphasizing the structure of pages and the role of ItemIDs for efficient data management. The HNSW index is divided into a metadata page and subsequent index pages, with the metadata page containing essential configuration details such as dimensions, connection limits, and entry points for the HNSW graph. The article details the structures used for element tuples and neighbor info tuples, explaining how they store information about graph nodes and their connections. It highlights optimizations in the storage process, such as sharing element tuples for duplicate entries to save space. The article also discusses the visualization of index pages and the mapping of hex dumps to the corresponding data structures, providing insights into the internal workings of the pgvector index. Additionally, a parser developed in C is mentioned, which converts the HNSW index into JSON format for better understanding and visualization. The article concludes with a demonstration of the index's behavior during insertions and deletions, showcasing how the system manages space and maintains efficiency.

- The HNSW index in pgvector is structured into metadata and index pages.

- Metadata includes configuration parameters essential for managing the HNSW graph.

- Element and neighbor info tuples store data about graph nodes and their connections.

- Optimizations allow for shared storage of duplicate entries to save space.

- A C-based parser converts the index into JSON for visualization and understanding.

Related

DuckDB: Vector Similarity Search Extension

DuckDB: Vector Similarity Search Extension

The vss extension in DuckDB enhances vector similarity search with HNSW indexing for ARRAY columns. Users can optimize queries with distance metrics but should be cautious due to limitations and experimental features.

Vectorlite: Fast Vector Search for SQLite

Vectorlite: Fast Vector Search for SQLite

Vectorlite is a runtime-loadable extension for SQLite enabling fast vector search with hnswlib on Windows, MacOS, and Linux. It supports SIMD acceleration, various distance types, and customizable HNSW parameters. Installation via `pip install vectorlite-py` in Python is suggested for usage. The GitHub page offers examples, API references, benchmarks, and more for detailed exploration.

Speeding up index creation in PostgreSQL

Speeding up index creation in PostgreSQL

Indexes in PostgreSQL play a vital role in enhancing database performance. This article explores optimizing index creation on large datasets by adjusting parameters like max_wal_size and shared_buffers, emphasizing data sorting and types for efficiency.

Postgres stores data on disk – this one's a page turner

Postgres stores data on disk – this one's a page turner

PostgreSQL stores data in a structured directory at /var/lib/postgresql/data, containing essential subdirectories and files for database operations, access control, statistics, and transaction management, aiding developers in data optimization.

MariaDB Introduces Open-Source Vector Preview

MariaDB Introduces Open-Source Vector Preview

MariaDB 11.6 introduces a public preview of its open-source Vector search feature, utilizing the HNSW algorithm to support large language models, aiming to compete with MySQL and attract developers.

Link Icon 4 comments
By @jadbox - 5 months
Vector searching had strange quirks where searching for "cat" would return mostly a lot of paragraphs unrelated to the word. I was using 3072 length for OAI text-embedding-3-large. Each entry was roughly 1-2 paragraphs. For my recent project, I found that PGroonga was more reliable for full text document lookup (with some fuzzy matching support).
By @simedw - 5 months
Very interesting breakdown, OP have you deep dived in pgvectorscale as well?
By @mkesper - 5 months
I wanted to read this article. Gave up because of absolutely missing contrast. Please, if you publish something, use black (#000) for text and almost white for background and not darker grey on a lighter grey background.