August 21st, 2024

ArcticDB: Why a Hedge Fund Built Its Own Database

Man Group developed ArcticDB to enhance performance in managing high-frequency, time-series data, addressing scaling issues with MongoDB. The proprietary database supports quantitative trading and reflects a trend in custom financial solutions.

Read original articleLink Icon
CuriositySkepticismAdmiration
ArcticDB: Why a Hedge Fund Built Its Own Database

Man Group, a significant alternative asset manager, developed its own database technology, ArcticDB, to address specific needs in handling high-frequency, time-series data. The decision stemmed from challenges faced with existing solutions, particularly with MongoDB, which became a scaling obstacle as the firm managed over $160 billion in assets and executed trades worth $6 trillion annually. ArcticDB, initially built in Python and backed by MongoDB, was later rewritten in C++ to enhance performance and scalability. This transition allowed the database to connect directly to object storage systems like S3, significantly improving efficiency. The motivation behind creating a proprietary database was to cater to the unique requirements of quantitative trading and data science workflows, which existing databases could not adequately support. ArcticDB is now integral to Man Group's operations, facilitating market data analysis and risk management across various financial sectors. The development reflects a broader trend in the finance industry, where firms often build custom solutions to meet specialized data handling needs.

- Man Group created ArcticDB to improve performance in managing high-frequency, time-series data.

- The database evolved from an initial Python version backed by MongoDB due to scaling limitations.

- ArcticDB connects directly to object storage for enhanced efficiency and scalability.

- The proprietary database addresses unique requirements in quantitative trading and data science.

- Custom database solutions are becoming common in the finance industry to meet specialized needs.

AI: What people are saying
The comments on the article about ArcticDB reveal several key insights and discussions surrounding the database's development and functionality.
  • Users appreciate ArcticDB's open-source nature and its suitability for handling time-series data, with some sharing personal experiences using it for projects.
  • There are comparisons made with other data storage solutions like TileDB, Delta Lake, and Apache Parquet, indicating a curiosity about ArcticDB's unique advantages.
  • Concerns are raised about the database's scalability and whether it can handle various timestamp resolutions effectively.
  • Some commenters express skepticism about the necessity of building a custom database when existing solutions could suffice, hinting at a broader industry trend of creating specialized tools.
  • Discussion includes the challenges of using traditional data analysis tools like pandas and the need for more efficient alternatives in quantitative trading environments.
Link Icon 12 comments
By @stackskipton - 8 months
Read the presentation. Answer was what I expected. We had unique problem and because we make oil drums amount of cash, dipping a bucket and taking that cash to solve the problem was easy justification.

These are really smart people solving problems they have but many companies don't have buckets of cash to hire really smart people to solve those problems.

Also, the questions after presentation pointed out the data isn't always analyzed in their database so it's more like storage system then database.

>Participant 1: What's the optimization happening on the pandas DataFrames, which we obviously know are not very good at scaling up to billions of rows? How are you doing that? On the pandas DataFrames, what kind of optimizations are you running under the hood? Are you doing some Spark?

>Munro: The general pattern we have internally and the users have, is that your returning pandas DataFrames are usable. They're fitting in memory. You're doing the querying, so it's like, limit your results to that. Then, once people have got their DataFrame back, they might choose another technology like Polars, DuckDB to do their analytics, depending on if they don't like pandas or they think it's too slow.

By @dnadler - 8 months
If it wasn’t clear from the article, this is open source and available on Man’s GitHub page:

https://github.com/man-group/arcticDB

I used to work at man, so take this with a grain of salt, but I really liked this and have spun it up at home for side projects over the years.

I’m not aware of other specialized storage options for dataframes, but would be curious if anyone knows of any.

By @faizshah - 8 months
I still didn’t get why they built this, there’s a better explanation of the feature set in the FAQ comparison with parquet: https://docs.arcticdb.io/latest/faq/

> How does ArcticDB differ from Apache Parquet?¶

> Both ArcticDB and Parquet enable the storage of columnar data without requiring additional infrastructure.

> ArcticDB however uses a custom storage format that means it offers the following functionality over Parquet:

> Versioned modifications ("time travel") - ArcticDB is bitemporal. > Timeseries indexes. ArcticDB is a timeseries database and as such is optimised for slicing and dicing timeseries data containing billions of rows. > Data discovery - ArcticDB is built for teams. Data is structured into libraries and symbols rather than raw filepaths. > Support for streaming data. ArcticDB is a fully functional streaming/tick database, enabling the storage of both batch and streaming data. > Support for "dynamic schemas" - ArcticDB supports datasets with changing schemas (column sets) over time. > Support for automatic data deduplication.

The other answer I was looking for was why not kdb since this is a hedge fund.

By @dang - 8 months
Related:

ArcticDB: A high-performance, serverless Pandas DataFrame database - https://news.ycombinator.com/item?id=35198131 - March 2023 (1 comment)

Introducing ArcticDB: Powering data science at Man Group - https://news.ycombinator.com/item?id=35181870 - March 2023 (1 comment)

By @chirau - 8 months
Two Sigma did a similar thing a few years back. It's called Smooth Storage.

https://www.twosigma.com/articles/smooth-storage-a-distribut...

By @jjmunro - 8 months
Hi. I'm the presenter. Thanks for the interest. Opinions here are my own.

I'll put in a TLDR as the presentation is quite long. The other thing I'd like to say was that QCon London impressed me, the organisers spent time ensuring a good quality of presentation. The other talks that I saw were great. Many conferences I've been to recently are just happy to get someone, or can choose and go with well known quantities. I first attended QCon London early in my career, so it was interesting coming back after over a decade to present.

TLDR:

Why did we build our own database? In effort terms, successful quantative trading is more about good ideas well executed than it is about production trading technology (apart from perhaps HFT). We needed something that helped the quants be the most productive with data.

We needed something that was:

- Easy to use (I mean really easy for beginner/moderate programmers). We talk about day 1 productivity for new starters. Python is a tool for Quants not a career.

- Cost effective to run (no large DB infra, easy to maintain, cheap storage, low licensing)

- Performant (traditional SQL DBs don't compare here, we're in the Parquet, Clickhouse, KBD, etc space)

- Scalable (large data-science jobs 10K+ cores, on-demand)

A much shorter 3 min intro from PyQuantNews: https://www.youtube.com/watch?v=5_AjD7aVEEM

GitHub repo (Source-available/BSL): https://github.com/man-group/ArcticDB

By @andrewstuart - 8 months
Blog post from same company in two years:

"How we switched from a custom database to Postgres".

By @bdjsiqoocwk - 8 months
Isn't it constrained to minutely timestamps or something like that.
By @jmakov - 8 months
Is there any reason to use that instead of Delta lake?
By @quickvi - 8 months
anyone knows how it compares to TileDB? Seems like TileDB is just a better ArticDB
By @tda - 8 months
I know there are tons of problems that are solved in excel while they really shouldn't. Instead of getting the expert business analyst to use a better tool (like pandas), money is spent to "fix" excel.

Apparently there is also a class of problems that outgrow pandas. And instead of the business side switching to more suitable tools, some really smart people are hired to build crutches around pandas.

Oh well, they probably had fun doing it. Maybe they get to work on nogil python next