July 3rd, 2024

Optimizing Large-Scale OpenStreetMap Data with SQLite

The article discusses optimizing large-scale OpenStreetMap data with SQLite. Converting OSMPBF to SQLite enhanced search functionalities. Indexing, full-text search, and compression improved query performance, despite some false positives.

Read original article

Optimizing Large-Scale OpenStreetMap Data with SQLite

The article discusses the optimization of large-scale OpenStreetMap (OSM) data using SQLite. The author converted a massive dataset from OSMPBF format to an SQLite database to enhance search functionalities. OSM data comprises nodes, ways, and relations, each with associated metadata. Initially, the SQLite database was large, prompting the need for optimization to improve query performance. The author explored indexing and full-text search techniques in SQLite to speed up searches. By compressing the SQLite file using Zstandard, the database size was significantly reduced, improving read performance. Despite some false positives in search results, the compressed database allowed for faster queries. Further optimization included reducing false positives in queries and enhancing search efficiency. The project showcases the iterative refinement process and the effectiveness of combining different technologies to address complex data optimization challenges.

PostgreSQL Statistics, Indexes, and Pareto Data Distributions

Close's Dialer system faced challenges due to data growth affecting performance. Adjusting PostgreSQL statistics targets and separating datasets improved performance. Tips include managing dead rows and optimizing indexes for efficient operation.

Our great database migration

Shepherd, an insurance pricing company, migrated from SQLite to Postgres to boost performance and scalability for their pricing engine, "Alchemist." The process involved code changes, adopting Neon database, and optimizing performance post-migration.

OSMnx: Python for Street Networks

OSMnx is a Python package for accessing, modeling, and visualizing street networks from OpenStreetMap. It simplifies tasks like downloading global street networks, calculating travel times, and conducting spatial analyses. Researchers and urban planners benefit from its efficiency.

Graph-Based Ceramics

The article explores managing ceramic glazes in a kiln and developing an app. It compares Firebase, Supabase, and Instant databases, highlighting Instant's efficiency in handling complex relational data for ceramic management.

Graph-Based Ceramics

The article explores managing ceramic glazes in a kiln and creating an app for this purpose. It compares Firebase and Instant databases, Supabase, Postgres, and InstaQL for efficient data handling.

13 comments

By @wcedmisten - 10 months

I recently discovered DuckDB's Read_OSM() function [1], which lets you query OSM PBF files directly.

For example, it's simple to count the cafes in North America in under 30s:

  SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │        57150 │
  └──────────────┘
  Run Time (s): real 24.643 user 379.067204 sys 3.696217

Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.

[1]: https://duckdb.org/docs/extensions/spatial.html#st_readosm--...

[2]: https://github.com/duckdb/duckdb_spatial/issues/349

By @simonw - 10 months

I liked the trick used here for speeding up tag key/value queries using a FTS index:

    SELECT id
    FROM entries e
    JOIN search s ON s.rowid = e.id
    WHERE
    -- use FTS index to find subset of possible results
    search MATCH 'amenity cafe'
    -- use the subset to find exact matches
    AND tags->>'amenity' = 'cafe';

By @DonnyV - 10 months

If you convert those PBFs to Parquet files. You can then use Duckdb to search them with sub-sec response. Plus you get the added bonus of being able to host in an S3 type cloud storage.

https://duckdb.org/docs/extensions/httpfs/s3api

By @sdenton4 - 10 months

So, hype aside, what's the over/under on DuckDB vs Sqlite these days? I'm working on a Thing right now, has started with sqlite due to being a) good enough, b) stupendously optimized and nails hardened, and c) runs on your phone, your toaster, and your server.

What's DuckDB bringing to the table relative to sqlite, which seems like the boring-and-therefore-best choice?

By @zamazan4ik - 10 months

If you are interested in optimizing the project further, I can suggest you rebuilding SQLite with Profile-Guided Optimization (PGO). I collected as many as possible materials (including many related benchmarks) in my repo: https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and PGO, I have the following link: https://sqlite.org/forum/forumpost/19870fae957d8c1a

By @dzogchen - 10 months

> This highlights the importance of iterative refinement and the power of combining different technologies to solve problems.

This uninformative non-sentence sounds an awful lot like ChatGPT.

By @Scaevolus - 10 months

> CREATE INDEX entries_name ON entries(tags->>'name'); > However, this requires an index per tag, which won’t scale, especially for a dynamic list of tags.

That's not how indexes work at all. This will be fine.

By @mnmatin - 10 months

Sometime ago, I had to extract a large amount of data from OSM and found the process harder than it should have been. Similar to how you shrunk the size by 50% after removing unnecessary tags. Ended up creating a python package 'earth-osm' (https://github.com/pypsa-meets-earth/earth-osm) that makes things more intuitive. Always wanted to push the data into a database but never got around to that....

By @sitkack - 10 months

I went through a similar process when I converted wikipedia xml dumps into sqlite.

https://news.ycombinator.com/item?id=28013347

By @marklit - 10 months

The GPKG files https://github.com/marklit/osm_split produces are SQLite files under the hood and come with an R-Tree Spatial index so they'll load into QGIS and other GIS software quickly.

By @kopirgan - 10 months

This is brilliant

PostgreSQL Statistics, Indexes, and Pareto Data Distributions

Our great database migration

OSMnx: Python for Street Networks

Graph-Based Ceramics

The article explores managing ceramic glazes in a kiln and creating an app for this purpose. It compares Firebase and Instant databases, Supabase, Postgres, and InstaQL for efficient data handling.

Optimizing Large-Scale OpenStreetMap Data with SQLite

Related

PostgreSQL Statistics, Indexes, and Pareto Data Distributions

Our great database migration

OSMnx: Python for Street Networks

Graph-Based Ceramics

Graph-Based Ceramics

Related

PostgreSQL Statistics, Indexes, and Pareto Data Distributions

Our great database migration

OSMnx: Python for Street Networks

Graph-Based Ceramics

Graph-Based Ceramics