July 3rd, 2024

Optimizing Large-Scale OpenStreetMap Data with SQLite

The article discusses optimizing large-scale OpenStreetMap data with SQLite. Converting OSMPBF to SQLite enhanced search functionalities. Indexing, full-text search, and compression improved query performance, despite some false positives.

Read original articleLink Icon
Optimizing Large-Scale OpenStreetMap Data with SQLite

The article discusses the optimization of large-scale OpenStreetMap (OSM) data using SQLite. The author converted a massive dataset from OSMPBF format to an SQLite database to enhance search functionalities. OSM data comprises nodes, ways, and relations, each with associated metadata. Initially, the SQLite database was large, prompting the need for optimization to improve query performance. The author explored indexing and full-text search techniques in SQLite to speed up searches. By compressing the SQLite file using Zstandard, the database size was significantly reduced, improving read performance. Despite some false positives in search results, the compressed database allowed for faster queries. Further optimization included reducing false positives in queries and enhancing search efficiency. The project showcases the iterative refinement process and the effectiveness of combining different technologies to address complex data optimization challenges.

Link Icon 13 comments
By @wcedmisten - 7 months
I recently discovered DuckDB's Read_OSM() function [1], which lets you query OSM PBF files directly.

For example, it's simple to count the cafes in North America in under 30s:

  SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │        57150 │
  └──────────────┘
  Run Time (s): real 24.643 user 379.067204 sys 3.696217

Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.

[1]: https://duckdb.org/docs/extensions/spatial.html#st_readosm--...

[2]: https://github.com/duckdb/duckdb_spatial/issues/349

By @simonw - 7 months
I liked the trick used here for speeding up tag key/value queries using a FTS index:

    SELECT id
    FROM entries e
    JOIN search s ON s.rowid = e.id
    WHERE
    -- use FTS index to find subset of possible results
    search MATCH 'amenity cafe'
    -- use the subset to find exact matches
    AND tags->>'amenity' = 'cafe';
By @DonnyV - 7 months
If you convert those PBFs to Parquet files. You can then use Duckdb to search them with sub-sec response. Plus you get the added bonus of being able to host in an S3 type cloud storage.

https://duckdb.org/docs/extensions/httpfs/s3api

By @sdenton4 - 7 months
So, hype aside, what's the over/under on DuckDB vs Sqlite these days? I'm working on a Thing right now, has started with sqlite due to being a) good enough, b) stupendously optimized and nails hardened, and c) runs on your phone, your toaster, and your server.

What's DuckDB bringing to the table relative to sqlite, which seems like the boring-and-therefore-best choice?

By @zamazan4ik - 7 months
If you are interested in optimizing the project further, I can suggest you rebuilding SQLite with Profile-Guided Optimization (PGO). I collected as many as possible materials (including many related benchmarks) in my repo: https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and PGO, I have the following link: https://sqlite.org/forum/forumpost/19870fae957d8c1a
By @dzogchen - 7 months
> This highlights the importance of iterative refinement and the power of combining different technologies to solve problems.

This uninformative non-sentence sounds an awful lot like ChatGPT.

By @Scaevolus - 7 months
> CREATE INDEX entries_name ON entries(tags->>'name'); > However, this requires an index per tag, which won’t scale, especially for a dynamic list of tags.

That's not how indexes work at all. This will be fine.

By @mnmatin - 7 months
Sometime ago, I had to extract a large amount of data from OSM and found the process harder than it should have been. Similar to how you shrunk the size by 50% after removing unnecessary tags. Ended up creating a python package 'earth-osm' (https://github.com/pypsa-meets-earth/earth-osm) that makes things more intuitive. Always wanted to push the data into a database but never got around to that....
By @sitkack - 7 months
I went through a similar process when I converted wikipedia xml dumps into sqlite.

https://news.ycombinator.com/item?id=28013347

By @marklit - 7 months
The GPKG files https://github.com/marklit/osm_split produces are SQLite files under the hood and come with an R-Tree Spatial index so they'll load into QGIS and other GIS software quickly.
By @kopirgan - 7 months
This is brilliant