October 14th, 2024

A FLOSS platform for data analysis pipelines that you probably haven't heard of

Arvados is an open-source platform for managing large datasets, featuring Keep for storage, Crunch for workflow orchestration, and ensuring data security. Users can access it via web, command line, or API.

Read original articleLink Icon
A FLOSS platform for data analysis pipelines that you probably haven't heard of

Arvados is an open-source platform designed for managing and processing large volumes of data, ranging from terabytes to petabytes. It features a content addressable storage system called Keep, which ensures high reliability and throughput for file management. Keep allows users to create collections of data without the need for reorganization or duplication, and it operates on various filesystems and object stores. The platform also includes Crunch, an orchestration system that runs Common Workflow Language (CWL) workflows, maintaining data provenance and reproducibility while optimizing costs in cloud environments. Arvados emphasizes security and compliance with data protection regulations, offering features such as access tokens, data encryption, and integration with external authentication systems like Active Directory and Google accounts. Users can interact with Arvados through a web application called Workbench, a command line interface, or via a RESTful API with available SDKs for multiple programming languages. This flexibility allows for easy integration with existing infrastructures and enhances user experience in querying, browsing, and visualizing data.

- Arvados is an open-source platform for managing large datasets.

- Key components include Keep for storage and Crunch for workflow orchestration.

- The platform ensures data security and compliance with regulations.

- Users can access Arvados through a web application, command line, or API.

- SDKs are available for various programming languages to facilitate integration.

Link Icon 2 comments
By @moandcompany - 7 months
It might be useful to mention that Arvados appears to be meant for biomedical data. It doesn't appear to point this out on the homepage, and you have to read the About page for context:

"Arvados is a modern open source platform for managing and processing large biomedical data. By combining robust data and workflow management capabilities in a single platform, Arvados can organize and analyze petabytes of data and run reproducible and versioned computational workflows."

By @tetron - 7 months
I don't think this has been discussed on Hacker News before, but I wonder if people have any opinions about Arvados?