August 15th, 2024

Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

Denormalized is a developing stream processing engine based on Apache DataFusion, supporting Kafka. Users can start with Docker and Rust/Cargo, with future features planned for enhanced functionality.

Read original article

CuriosityExcitementInterest

Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

Denormalized is a fast embeddable stream processing engine built on Apache DataFusion, aimed at real-time stream processing with support for Kafka as both a source and sink. The project is currently in development, and the team is seeking design partners to collaborate on specific use cases. Users can engage with the developers through GitHub issues or email. To get started, users need Docker and Rust/Cargo installed. The quickstart guide includes instructions for running Kafka in Docker, emitting sample data, and performing simple streaming aggregations. Additional examples, such as a Kafka ridesharing scenario, are also provided. The roadmap outlines completed features like stream aggregation and joins, with future plans for checkpointing, session windows, a stateful UDF API, and integrations with DuckDB, PostgreSQL, Python, and TypeScript, along with a user interface. The project is maintained by a team in San Francisco, and inquiries can be directed to hello@denormalized.io or through GitHub.

- Denormalized is a stream processing engine built on Apache DataFusion.

- It supports Kafka for real-time data processing and is currently in development.

- Users can start using it with Docker and Rust/Cargo.

- Future features include checkpointing, session windows, and various integrations.

- The project team is open to collaboration and inquiries via GitHub or email.

The Ultimate Database Platform

AverageDB, a database platform for developers, raised $50 million in funding. It offers speed, efficiency, serverless architecture, real-time data access, and customizable pricing. The platform prioritizes data privacy and caters to diverse user needs.

DuckDB Meets Postgres

Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.

Show HN: I made a TUI for kafka (kaskade)

The GitHub repository "Kaskade" offers a text user interface for Apache Kafka, providing admin features and consumer functionalities. It includes installation guidelines, configuration examples, development guidance, and screenshots. Visit [sauljabin/kaskade] for more details.

Show HN: Pg_replicate – Build Postgres replication applications in Rust

pg_replicate is a Rust crate for PostgreSQL data replication, supporting logical streaming replication. It offers easy integration, a quickstart guide, and plans for future enhancements and additional data sinks.

Launch HN: Synnax (YC S24) – Unified hardware control and sensor data streaming

Synnax is a platform that connects sensors and actuators for real-time telemetry and data analysis, featuring a scalable time series database, supporting multiple programming languages, and offering free usage for up to 50 channels.

AI: What people are saying

The comments on the article about Denormalized reflect a mix of interest and inquiries regarding the new stream processing engine.

Users express excitement about the project's potential and ease of setup.
Several commenters inquire about specific features, such as support for OLAP use cases and pluggable data sources.
There is curiosity about how Denormalized compares to existing solutions like Arroyo and Flink.
Founders and developers from related projects show interest in collaboration and integration with Denormalized.
Many users are eager for future features, including a Python SDK and TypeScript bindings.

16 comments

By @dman - 9 months

This looks super interesting. I built https://github.com/finos/perspective in a past life but have been out of the streaming analytics game for some time. Nice to see single machine efficiency be a focus, will give this a try and post feedback on github.

By @emgeee - 9 months

Other founder here -- we've been working on this now for several months and have had a lot of fun building on top of arrow and datafusion

By @theLiminator - 9 months

Are you going to support OLAP use cases as well? I haven't yet found a really nice hybrid batch/streaming query engine with dataframe support.

Ideally, you'd support an api similar to Polars (which I have found to be the nicest thus far).

It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

By @j-pb - 9 months

I'd be curious to know what your thoughts on differential/timely dataflow are. Superficially it seems that it might be possible to integrate the existing Rust infrastructure from those libraries with DataFusion and Arrow, which could give you quite a few operators for free, and provide your users with the very nice incremental query/streaming-as-view-maintenance model.

By @ethegwo - 9 months

Neat, founder of https://tonbo.io/ here, I am excited to see someone bring stream processing to datafusion, we are working on a arrow-native embedded db and plan to support datafusion in the next release, we’re interested in building the streaming feature on denormalized.

By @shrisukhani - 9 months

Interesting. What use cases are you guys targeting with this?

By @stereosky - 9 months

Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment

By @eXpl0it3r - 9 months

For someone not deep in the topic, what is a "Streaming Processing Engine"?

All the description for Denormalized use the term, so if don't know it, it's kind of impossible to understand what Denormalized is / trying to solve.

By @nonlogical - 9 months

This looks totally awesome! Easy to setup, memory-efficient, streaming, real-time data aggregation, compilable to a single self contained binary, that is a dream come true.

Bookmarked for future projects!

By @ztratar - 9 months

Will be excited to see the typescript bindings once out. We may be able to use this to handle some of our workloads at Embra.

Will reach out! Congrats on the ship.

By @drawnwren - 9 months

What differentiates you from i.e. Arroyo and Fluvio?

By @franciscojarceo - 9 months

Can't wait for the Python SDK!

By @lhnz - 9 months

Do you have plans to make the data sources pluggable instead of being Kafka specific?

By @akshay2881 - 9 months

Nice! How feature complete is this with current industry standards like Flink?

By @rNULLED - 9 months

Looks cool! I’ll try it out for my ambitious project :)

Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

Related

The Ultimate Database Platform

DuckDB Meets Postgres

Show HN: I made a TUI for kafka (kaskade)

Show HN: Pg_replicate – Build Postgres replication applications in Rust

Launch HN: Synnax (YC S24) – Unified hardware control and sensor data streaming

Related

The Ultimate Database Platform

DuckDB Meets Postgres

Show HN: I made a TUI for kafka (kaskade)

Show HN: Pg_replicate – Build Postgres replication applications in Rust

Launch HN: Synnax (YC S24) – Unified hardware control and sensor data streaming