ArcticDB: Why a Hedge Fund Built Its Own Database
Man Group developed ArcticDB to enhance performance in managing high-frequency, time-series data, addressing scaling issues with MongoDB. The proprietary database supports quantitative trading and reflects a trend in custom financial solutions.
Read original articleMan Group, a significant alternative asset manager, developed its own database technology, ArcticDB, to address specific needs in handling high-frequency, time-series data. The decision stemmed from challenges faced with existing solutions, particularly with MongoDB, which became a scaling obstacle as the firm managed over $160 billion in assets and executed trades worth $6 trillion annually. ArcticDB, initially built in Python and backed by MongoDB, was later rewritten in C++ to enhance performance and scalability. This transition allowed the database to connect directly to object storage systems like S3, significantly improving efficiency. The motivation behind creating a proprietary database was to cater to the unique requirements of quantitative trading and data science workflows, which existing databases could not adequately support. ArcticDB is now integral to Man Group's operations, facilitating market data analysis and risk management across various financial sectors. The development reflects a broader trend in the finance industry, where firms often build custom solutions to meet specialized data handling needs.
- Man Group created ArcticDB to improve performance in managing high-frequency, time-series data.
- The database evolved from an initial Python version backed by MongoDB due to scaling limitations.
- ArcticDB connects directly to object storage for enhanced efficiency and scalability.
- The proprietary database addresses unique requirements in quantitative trading and data science.
- Custom database solutions are becoming common in the finance industry to meet specialized needs.
Related
The Ultimate Database Platform
AverageDB, a database platform for developers, raised $50 million in funding. It offers speed, efficiency, serverless architecture, real-time data access, and customizable pricing. The platform prioritizes data privacy and caters to diverse user needs.
Artie (YC S23) Is Hiring
Artie, a San Francisco startup, seeks a Founding Engineer to shape the product's future. Responsibilities include customer interaction, database work, and utilizing real-time data streaming technologies like Kafka.
Is an All-in-One Database the Future?
Specialized databases are emerging to tackle complex data challenges, leading to intricate infrastructures. A universal, all-in-one database remains unfulfilled due to optimization issues and unique challenges of different database types.
The Future of Kdb+
The article examines kdb+'s future in financial services, noting competition from newer technologies and suggesting KX should enhance its product and consider strategic changes to maintain relevance.
pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.
- Users appreciate ArcticDB's open-source nature and its suitability for handling time-series data, with some sharing personal experiences using it for projects.
- There are comparisons made with other data storage solutions like TileDB, Delta Lake, and Apache Parquet, indicating a curiosity about ArcticDB's unique advantages.
- Concerns are raised about the database's scalability and whether it can handle various timestamp resolutions effectively.
- Some commenters express skepticism about the necessity of building a custom database when existing solutions could suffice, hinting at a broader industry trend of creating specialized tools.
- Discussion includes the challenges of using traditional data analysis tools like pandas and the need for more efficient alternatives in quantitative trading environments.
These are really smart people solving problems they have but many companies don't have buckets of cash to hire really smart people to solve those problems.
Also, the questions after presentation pointed out the data isn't always analyzed in their database so it's more like storage system then database.
>Participant 1: What's the optimization happening on the pandas DataFrames, which we obviously know are not very good at scaling up to billions of rows? How are you doing that? On the pandas DataFrames, what kind of optimizations are you running under the hood? Are you doing some Spark?
>Munro: The general pattern we have internally and the users have, is that your returning pandas DataFrames are usable. They're fitting in memory. You're doing the querying, so it's like, limit your results to that. Then, once people have got their DataFrame back, they might choose another technology like Polars, DuckDB to do their analytics, depending on if they don't like pandas or they think it's too slow.
https://github.com/man-group/arcticDB
I used to work at man, so take this with a grain of salt, but I really liked this and have spun it up at home for side projects over the years.
I’m not aware of other specialized storage options for dataframes, but would be curious if anyone knows of any.
> How does ArcticDB differ from Apache Parquet?¶
> Both ArcticDB and Parquet enable the storage of columnar data without requiring additional infrastructure.
> ArcticDB however uses a custom storage format that means it offers the following functionality over Parquet:
> Versioned modifications ("time travel") - ArcticDB is bitemporal. > Timeseries indexes. ArcticDB is a timeseries database and as such is optimised for slicing and dicing timeseries data containing billions of rows. > Data discovery - ArcticDB is built for teams. Data is structured into libraries and symbols rather than raw filepaths. > Support for streaming data. ArcticDB is a fully functional streaming/tick database, enabling the storage of both batch and streaming data. > Support for "dynamic schemas" - ArcticDB supports datasets with changing schemas (column sets) over time. > Support for automatic data deduplication.
The other answer I was looking for was why not kdb since this is a hedge fund.
ArcticDB: A high-performance, serverless Pandas DataFrame database - https://news.ycombinator.com/item?id=35198131 - March 2023 (1 comment)
Introducing ArcticDB: Powering data science at Man Group - https://news.ycombinator.com/item?id=35181870 - March 2023 (1 comment)
https://www.twosigma.com/articles/smooth-storage-a-distribut...
I'll put in a TLDR as the presentation is quite long. The other thing I'd like to say was that QCon London impressed me, the organisers spent time ensuring a good quality of presentation. The other talks that I saw were great. Many conferences I've been to recently are just happy to get someone, or can choose and go with well known quantities. I first attended QCon London early in my career, so it was interesting coming back after over a decade to present.
TLDR:
Why did we build our own database? In effort terms, successful quantative trading is more about good ideas well executed than it is about production trading technology (apart from perhaps HFT). We needed something that helped the quants be the most productive with data.
We needed something that was:
- Easy to use (I mean really easy for beginner/moderate programmers). We talk about day 1 productivity for new starters. Python is a tool for Quants not a career.
- Cost effective to run (no large DB infra, easy to maintain, cheap storage, low licensing)
- Performant (traditional SQL DBs don't compare here, we're in the Parquet, Clickhouse, KBD, etc space)
- Scalable (large data-science jobs 10K+ cores, on-demand)
A much shorter 3 min intro from PyQuantNews: https://www.youtube.com/watch?v=5_AjD7aVEEM
GitHub repo (Source-available/BSL): https://github.com/man-group/ArcticDB
"How we switched from a custom database to Postgres".
Apparently there is also a class of problems that outgrow pandas. And instead of the business side switching to more suitable tools, some really smart people are hired to build crutches around pandas.
Oh well, they probably had fun doing it. Maybe they get to work on nogil python next
Related
The Ultimate Database Platform
AverageDB, a database platform for developers, raised $50 million in funding. It offers speed, efficiency, serverless architecture, real-time data access, and customizable pricing. The platform prioritizes data privacy and caters to diverse user needs.
Artie (YC S23) Is Hiring
Artie, a San Francisco startup, seeks a Founding Engineer to shape the product's future. Responsibilities include customer interaction, database work, and utilizing real-time data streaming technologies like Kafka.
Is an All-in-One Database the Future?
Specialized databases are emerging to tackle complex data challenges, leading to intricate infrastructures. A universal, all-in-one database remains unfulfilled due to optimization issues and unique challenges of different database types.
The Future of Kdb+
The article examines kdb+'s future in financial services, noting competition from newer technologies and suggesting KX should enhance its product and consider strategic changes to maintain relevance.
pg_duckdb: Splicing Duck and Elephant DNA
MotherDuck launched pg_duckdb, an open-source extension integrating DuckDB with Postgres to enhance analytical capabilities while maintaining transactional efficiency, supported by a consortium of companies and community contributions.